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Abstract 


Heap  allocation  with  copying  garbage  collection  is  a  general  storage  management  technique  for 
modern  programming  languages.  It  is  believed  to  have  poor  memory  subsystem  performance. 
To  investigate  this,  we  conducted  an  in-depth  study  of  the  memory  subsystem  performance  of 
heap  allocation  for  memory  subsystems  found  on  many  machines.  We  studied  the  performance  of 
mostly-functional  Standard  ML  programs  which  made  heavy  use  of  heap  allocation.  We  found  that 
most  machines  support  heap  allocation  poorly.  However,  with  the  appropriate  memory  subsystem 
organization,  heap  allocation  can  have  good  performance.  The  memory  subsystem  property  crucial 
for  achieving  good  performance  was  the  ability  to  allocate  and  initialize  a  new  object  into  the  cache 
without  a  penalty.  This  can  be  achieved  bv  having  subblock  placement  with  a  subblock  size  of  one 
word  with  a  write  allocate  policy,  along  with  fast  page-mode  writes  or  a  write  buffer.  For  caches 
with  subblock  placement,  the  data  cache  overhead  was  under  9%  for  a  64 K  of  larger  data  cache: 
without  subblock  placement  the  overhead  was  often  higher  than  50%. 


1  Introduction 


Heap  allocation  with  copying  garbage  collection  is  widely  believed  lo  have  poor  memory  subsystem 
performance  [30.  38.  48.  49.  50].  To  investigate  this,  we  conducted  an  extensive  study  of  memory 
subsystem  performance  of  heap  allocation  intensive  programs  on  memory  subsystem  organizations 
typical  of  many  workstations.  The  programs,  compiled  with  the  SML/NM  compiler  [4].  do  tremen¬ 
dous  amounts  of  heap  allocation,  allocating  one  word  every  4  to  10  instructions.  The  programs  used 
a  generational  copying  garbage  collector  to  manage  their  heaps.  To  our  surprise,  we  found  that 
for  some  configurations  corresponding  to  actual  machines,  such  as  the  DECStation  5000/200.  the 
memory  subsystem  performance  was  comparable  to  that  of  C  and  Fortran  programs  [12]:  programs 
ran  only  3  to  13%  slower  due  to  data  cache  misses  than  they  would  have  with  an  infinitely  fast 
memorv.  For  other  configurations,  the  slowdown  due  to  data  cache  misses  was  often  higher  than 
50%. 

The  memory  subsystem  features  important  for  achieving  good  performance  wit  h  heap  allocation 
are  subblock  placement  with  a  subblock  size  of  one  word,  combined  with  write-allocate  on  write- 
miss,  page-mode  writes,  and  cache  sizes  of  32 K  or  larger.  Heap  allocation  performs  poorly  on 
machines  whose  caches  are  smaller  than  the  allocation  area  of  the  programs  (256K  or  larger  for 
the  benchmarks  studied  here)  and  do  not  have  one  or  more  of  the  features  mentioned  above:  this 
includes  most  current  workstations. 

Our  work  differs  from  previous  reported  work  [30.  38.  48.  49.  50]  on  memory  subsystem  per¬ 
formance  of  heap  allocation  in  two  important  wavs.  First,  previous  work  used  the  overall  miss 
ratio  as  the  performance  metric,  which  is  a  misleading  indicator  of  performance.  The  overall  miss 
ratio  neglects  the  fact  that  read  and  write  misses  may  have  different  costs.  Also,  the  overall  miss 
ratio  does  not  reflect  the  rates  of  reads  and  writes,  which  may  substantially  affect  performance. 
We  use  memory  subsystem  contribution  to  cycles  per  instruction  (CPI)  as  our  performance  metric, 
which  accurately  reflects  the  effect  of  the  memory  subsystem  on  program  running  time.  Second, 
previous  work  did  not  model  the  entire  memory  subsystem:  it  concentrated  solely  on  caches.  Mem¬ 
ory  subsystem  features  such  as  write  buffers  and  page-mode  writes  interact  with  the  costs  of  hits 
and  misses  in  the  cache  and  should  be  simulated  to  give  a  correct  picture  of  memory  subsystem 
behavior.  We  simulate  the  entire  memory  subsystem. 

We  did  the  study  by  instrumenting  programs  to  produce  traces  of  all  memory  references.  We  fed 
the  references  into  a  memory  subsystem  simulator  which  calculated  a  performance  penalty  due  to 
the  memory  subsystem.  We  fixed  the  architecture  to  be  the  MIPS  R3000  [28]  and  varied  cache  con¬ 
figurations  to  cover  the  design  space  typical  of  workstations  such  as  DECStations,  SPARCStations. 
and  HP  9000  series  700.  We  studied  eight  substantial  programs. 

We  varied  the  following  memory  subsystem  parameters:  cache  size  (8K  to  5 1 2 K ) .  cache  block 
size  (16  or  32  bytes),  write  miss  policy  (write  allocate  or  write  no-allocate),  subblock  placement 
(with  and  without),  associativity  (one  and  two  way).  TLB  sizes  ( 1  to  64  entries),  write  buffer  depth 
( 1  to  6  deep),  and  page-mode  writes  (with  and  without).  We  simulated  only  split  instruction  and 
data  caches,  i.e..  no  unified  caches.  We  report  data  only  for  write-through  caches  but  the  results 
extend  easily  to  write-back  caches. 

Section  2  gives  background  information.  Section  3  describes  related  work.  Section  I  describes 
the  simulation  methods,  the  benchmarks,  and  the  metrics  used  to  measure  memory  subsystem 
performance.  Section  5  presents  the  results  of  the  simulation  studies,  an  analysis  of  those  results, 
validation  of  those  results,  and  an  analytical  model  which  is  used  to  extend  the  results  to  programs 
with  different  allocation  behavior.  Section  6  suggests  promising  areas  for  future  work.  Section  7 
concludes. 


2  Background 

The  following  sections  describe  memory  subsystems,  copying  garbage  collection.  SML.  and  the 
SML/NJ  compiler. 

2.1  Memory  subsystems 

This  section  reviews  the  organization  of  memory  subsystems.  Terminology  for  memory  subsystems 
is  not  standardized;  we  use  Przvbylski's  terminology  [39]. 

It  is  well  known  that  CPUs  are  getting  faster  relative  to  DRAM  memory  chips  [37];  main 
memory  cannot  supply  the  CPl:  with  instructions  and  data  fast  enough.  A  solution  to  this  problem 
is  to  use  a  cache,  a  small  fast  memory  placed  between  the  CPI'  and  main  memory  that  holds  a 
small  subset  of  memory.  If  the  CPI  reads  a  memory  location  which  is  in  the  cache,  tin*  value  is 
returned  quickly.  Otherwise  the  CPC  must  wait  for  the  value  to  be  fetched  from  main  memory. 

Caches  work  by  reducing  the  average  memory  across  time.  This  is  possible  since  memory 
accesses  exhibit  spatial  and  tempoml  locality.  Temporal  locality  means  that  a  memory  location 
that  was  referenced  recently  will  probably  be  referenced  again  soon  and  is  thus  worth  storing  in 
the  cache.  Spatial  locality  means  that  a  memory  location  near  one  which  was  referenced  recently 
will  probably  be  referenced  soon.  Thus,  it  is  worth  moving  the  neighboring  locations  to  the  cache. 

2.1.1  Memory  subsystem  organization 

This  section  describes  cache  organization  for  a  single  level  of  caching.  A  cache  is  divided  into  black* 
each  of  which  has  an  associated  tay.  A  cache  block  represents  a  block  of  memory.  The  tag  for  a 
cache  block  indicates  what  memory  block  it  holds.  Cache  blocks  are  grouped  into  s<  ts.  A  memory 
block  may  reside  in  the  cache  in  exactly  one  set.  but  may  reside  in  any  block  within  the  set.  A 
cache  with  sets  of  size  n  is  said  to  be  n-waij  associntin .  If  n  =  l.  the  cache  is  called  din  cl-map/M  d. 
Some  caches  have  valid  bits,  to  indicate  what  sections  of  a  block  hold  valid  data.  A  subblack  is 
the  smallest  part  of  a  cache  with  which  a  valid  bit  is  associated.  In  t  his  paper,  subblock  placement 
implies  a  subblock  of  one  word,  i.e.,  valid  bits  are  associated  with  each  word.  Moreover,  on  a  read 
miss,  the  whole  block  is  brought  into  the  cache  not  just  the  subblock  that  missed.  Przybylski  [39] 
notes  that  this  is  a  good  choice. 

A  memory  access  to  a  location  which  is  resident  in  t  he  cache  is  called  a.  bit.  Otherwise,  the 
memory  access  is  a  miss. 

A  read  request  for  memory  location  m  causes  rn  to  be  mapped  to  a  set.  All  the  tags  and  valid 
bits  (if  any)  in  the  set  are  checked  to  see  if  any  block  contains  in.  If  a  cache  block  contains  m.  the 
word  corresponding  to  m  is  selected  from  the  cache  block.  A  read  miss  is  handled  bv  copying  the 
missing  block  from  the  main  memory  to  the  cache. 

The  way  write  requests  are  handled  depends  upon  the  write  ]X)licy.  The  write  policy  describes 
whether  writes  to  the  cache  go  immediately  to  main  memory.  In  a  write-throiu/h  cache,  writes 
to  the  cache  immediately  go  to  main  memory.  In  a  write-lmck  cache,  writes  to  the  cache  do  not 
immediately  go  to  main  memory;  they  are  just  written  to  the  cache.  The  writes  eventually  go 
to  main  memory  when  a  memory  block  is  removed  from  the  cache.  Write-back  caches  use  less 
bus  bandwidth  than  write-through  caches,  because  multiple  writes  to  the  same  location  may  be 
coalesced  into  one  write  to  main  memory  by  the  write  back  cache,  whereas  all  the  writes  would 
go  to  main  memory  with  a  write  through  cache.  See  [27]  for  a  discussion  of  the  relative  merits  of 
write  back  and  write  through  caches. 

A  write  hit  is  always  written  to  the  cache.  There  are  several  policies  for  handling  a  write  miss, 
which  differ  in  their  performance  penalties.  For  each  of  the  policies,  the  actions  taken  on  a  write 
miss  are: 

l.  write- no-allocate: 

•  Do  not  allocate  a  block  in  the  cache 

•> 


•  Send  t lie  write  to  main  memory,  without  putting  the  write  in  the  radio. 

2.  write-allocate.  no-subl>lock  placement: 

•  Allocate  a  block  in  the  cache. 

•  Fetch  the  corresponding  memory  block  from  main  memory. 

•  Write  the  word  to  the  cache  (and  to  memory  if  write  through). 

3.  write-allocate,  subblock  placement1: 

If  the  tag  matches  but  the  valid  bit  is  off: 

•  Write  the  word  to  the  cache  (and  to  memory  if  write  through). 

If  the  tag  does  not  match: 

•  Allocate  a  block  in  the  cache. 

•  Write  the  word  to  the  cache  (and  to  memory  if  write  through). 

•  Invalidate  the  remaining  words  in  the  block. 

Write  allocate /subblock  placement  will  have  a  lower  write  miss  penalty  than  tm/r  allocate /no 
subblock  placement  since  it  avoids  fetching  a  memory  block  from  main  memory.  In  addition,  it 
will  have  a  lower  penalty  than  write  no  allocate  if  the  written  word  is  read  before  being  evicted 
from  the  cache.  See  Jouppi  [27]  for  more  information  on  write-miss  policies. 

A  miss  is  a  compulsory  miss  if  it  is  due  to  a  memory  block  being  accessed  for  t  lie  first  time.  A 
miss  is  a  capacity  miss  if  it  results  from  the  cache  not  being  large  enough  to  hold  all  the  memory 
blocks  used  bv  a  program.  The  capacity  misses  for  a  given  cache  size  correspond  to  the  misses  in 
a  fully  associative  cache  of  the  same  size  with  an  ITU’  replacement  policy  minus  the  compulsory 
misses.  It  is  a  conflict  miss  if  it  results  from  two  memory  blocks  mapping  to  the  same  set.  [25] 

The  memory  subsystem  bandwidth  may  be  increased  by  using  separate  caches  for  instructions 
and  data.  This  is  called  a  split  instruction-data  cache.  The  memory  bandwidth  is  increased  since  a 
data  access  and  an  instruction  fetch  may  be  handled  at  the  same  time.  A  cache  where  instructions 
and  data  gc  to  the  same  cache  is  called  a  unified  cache.  This  paper  presents  results  only  for  split 
instruction-data  caches. 

A  write  buffer  may  be  used  to  reduce  the  cost,  of  writes  to  main  memory.  A  write  buffer  is  a 
queue  containing  writes  that  are  to  be  sent  to  main  memory.  When  the  CPU  does  a  write,  the 
write  is  placed  in  the  write  buffer  and  the  CPU  continues  without  waiting  for  the  write  to  finish. 
The  write  buffer  retires  entries  to  main  memory  using  free  memory  cycles.  There  are  situations 
when  the  write  buffer  is  not  fully  effective  in  preventing  stalls  on  writes  to  main  memory.  First,  if 
the  CPU  writes  to  a  full  write  buffer,  the  CPU  must  wait  for  an  entry  to  become  available  in  the 
write  buffer.  Second,  if  the  CPU  reads  a  location  which  is  queued  up  in  the  write  buffer,  the  CPU 
may  need  to  wait  until  the  write  buffer  is  empty.  Third,  if  the  CPU  issues  a  read  to  main  memory 
while  a  write  is  in  progress,  the  CPU  must  wait  for  the  write  to  finish. 

Main  memory  is  divided  into  DRAM  pages.  Page-mode  irrites  reduce  the  latency  of  writes  to 
the  same  DRAM  page  when  there  are  no  intervening  memory  accesses  to  another  DRAM  page. 
Page-mode  writes  work  as  follows.  DRAMs  are  organized  internally  as  arrays,  and  all  the  locations 
on  a  DRAM  page  reside  on  the  same  row  in  the  DRAMs  which  implement  main  memory.  This  fact 
can  be  used  to  speed  up  a  sequence  of  writes  to  one  DRAM  page.  A  DRAM  is  updated  in  a  read- 
modify-write  cycle:  an  array  row  is  latched  into  a  row  buffer,  the  row  buffer  is  modified,  and  then 
written  back  to  the  array.  A  sequence  of  writes  to  the  same  DRAM  page  ran  update  the  row  while 
it  is  held  in  the  row  buffer,  and  avoid  the  read  and  write  cycles  for  all  but  the  first  and  last  writes, 
respectively.  This  improves  write  speed  significantly.  For  example,  on  a  l)K( 'Station  5000/200.  a 
non- page- mode  write  takes  5  cycles,  while  a  page- mode  write  takes  I  cycle.  Main  memory  is  said 


Recall  silbblork  size  is  .is*.iniic<|  In  be  I  word. 


X  check  for  heap  overflow 
cmp  alloc+12,top 
branch-if-gt  call-gc 
X  write  the  object 
store  tag, (alloc) 
store  ra,4(alloc) 
store  rd, 8 (alloc) 

X  save  pointer  to  object 
move  alloc+4, result 
X  add  12  to  alloc  pointer 
add  alloc , 12 


Figure  1:  Pseudo-assembly  code  for  allocating  an  object 


to  be  operating  in  page  mode  when  DRAM  rows  are  held  in  row  buffers  across  memory  accesses. 
It  is  thrown  out  of  page  mode  when  a  memory  access  to  a  different  DR  AM  page  is  made.  It  may 
also  be  thrown  out  of  page  mode  for  other  machine-specific  reasons  (such  as  refreshes).  Page-mode 
writes  are  especially  effective  at  handling  writes  with  high  spatial  locality,  such  as  those  seen  when 
saving  registers  at  a  procedure  call  or  when  doing  sequential  allocation. 

2.1.2  Memory  subsystem  performance 

This  section  describes  two  metrics  for  measuring  the  performance  of  memory  subsystems.  One 
popular  metric  is  the  cache  miss  ratio.  The  cache  miss  ratio  is  the  number  of  memory  accesses 
which  miss  divided  by  the  total  number  of  memory  accesses.  Since  different  kinds  of  memory 
accesses  usually  have  different  miss  costs,  it  is  useful  to  have  miss  ratios  for  each  kind  of  access. 

Cache  miss  ratios  alone  do  not  measure  the  impact  of  the  memory  subsystem  on  overall  system 
performance.  A  metric  which  better  measures  this  is  the  contribution  of  the  memory  subsystem  to 
CPI  (cycles  per  useful  instruction2).  CPI  is  calculated  for  a  program  as  number  of  CPI'  cycles  to 
complete  the  program  /  total  number  of  useful  instructions  executed.  It  measures  how  efficiently  the 
CPU  is  being  utilized.  The  contribution  of  the  memory  subsystem  to  CPI  is  calculated  as  number  of 
CPU  cycles  spent  waiting  for  the  memory  subsystem  /  total  number  of  useful  instructions  executed. 
As  an  example,  on  a  DECStation  5000/200,  the  lowest  CPI  possible  is  1.  completing  one  instruction 
per  cycle.  If  the  CPI  for  a  program  is  1.50,  and  the  memory  contribution  to  CPI  is  0.3,  20%  (0.3/ 1.5) 
of  the  CPU  cycles  are  spent  waiting  for  the  memory  subsystem  (the  rest  may  be  due  to  other  causes 
such  as  nops,  multi-cycle  instructions  like  integer  division,  etc.).  CPI  is  machine  dependent  since 
it  is  calculated  using  actual  penalties. 

2.2  Copying  garbage  collection 

A  copying  garbage  collector  [22.  14]  reclaims  an  area  of  memory  bv  copying  all  the  live  (non¬ 
garbage)  data  to  another  area  of  memory.  This  means  that  all  data  in  the  garbage-collected  area 
is  now  garbage,  and  the  area  can  be  re-used.  Since  memory  is  always  reclaimed  in  large  contiguous 
areas,  objects  can  be  sequentially  allocated  from  such  areas  at  the  cost  of  only  a  few  instructions. 
Figure  1  gives  an  example  of  pseudo-assembly  code  for  allocating  a  cons  cell,  ra  contains  the  car 
cell  contents,  rd  contains  the  cdr  cell  contents,  alloc  is  the  address  of  the  next,  free  word  in  the 
allocation  area,  and  top  contains  the  end  of  the  allocation  area. 


2 All  instructions  besides  nops  .ire  considered  as  useful.  A  nop  (null  operation)  instruction  is  a  software-controlled 
pipeline  stall. 


I 


The  SML/N.J  compiler  uses  a  simple  generational  copying  garbage  collector  [2].  Memory  is 
divided  into  an  old  generation  and  an  allocation  area.  New  objects  are  created  in  the  allocation 
area:  garbage  collection  copies  the  live  objects  in  the  allocation  area  to  t  lie  old  generation  freeing  up 
fhe  allocation  area.  Generational  garbage  collection  relies  on  the  fact  that  most  allocated  objects 
die  young;  thus  most  objects  (about  99%  [4.  p.  20b])  are  not  copied  from  the  allocation  area.  T  his 
makes  the  garbage  collector  efficient,  since  it  works  mostly  on  an  area  of  memory  where  it  is  very 
effective  at  reclaiming  space. 

The  most  important  property  of  a  copying  collector  with  respect  to  memory  subsystem  behavior 
is  that  allocation  initializes  memory  which  has  not  been  touched  in  a  long  time  and  is  thus  unlikely 
to  be  in  the  cache.  This  is  especially  true  if  the  allocation  area  is  large  relative  to  the  size  of  the 
cache  since  allocation  will  knock  everything  out  of  the  cache.  1'his  means  that  caches  which  cannot 
hold  the  allocation  area  will  incur  a  large  number  of  write  misses. 

For  example  consider  the  code  in  Figure  1.  Assume  that  a  cache  write  miss  cost.-,  |(j  CI’F  cycles 
and  that  the  block  size  is  1  words.  On  average,  every  fourth  word  allocated  causes  a  write  miss. 
Thus,  the  average  memory  subsystem  cost,  of  allocating  a  word  on  t  lie  heap  is  I  cycles.  The  average 
cost  for  allocating  a  cons  cell  is  seven  cycles  (at  one  cycle  per  instruction)  plus  12  cycles  for  the 
memory  subsystem  overhead.  Thus,  while  allocation  is  cheap  in  terms  of  instruction  counts,  il  may 
be  expensive  in  terms  of  machine  cycle  counts. 

2.3  Standard  ML 

Standard  ML  (SML)  [35]  is  a  call-by-value,  lexically  scoped  language  with  higher-order  functions, 
with  many  of  the  features  deemed  good  by  the  programming  language  community.  It  lias  garbage 
collection  to  automate  the  management  of  heap  storage.  This  eliminates  two  common  kinds  of 
programming  errors  that  occur  with  explicit  storage  management,  memory  leaks  and  dangling 
pointers.  Memory  leaks  occur  when  memory  is  never  deallocated,  and  dangling  pointers  occur 
when  memory  is  deallocated  too  soon.  SML  is  statically  typed,  so  many  programming  errors  are 
caught  at  compile-time.  The  type  system  is  polymorphic,  and  types  are  inferred  automatically 
by  the  compiler,  so  the  type  system  is  flexible  yet  not  an  impediment  to  the  programmer.  The 
language  is  provably  safe,  that  is,  there  are  no  holes  in  the  type  system  and  a  program  always  has  a 
well-defined  behavior.  SML  has  a  sophisticated  module  system  to  support  the  development  of  large 
programs.  The  module  system  provides  for  static  type-checking  of  the  interfaces  between  modules, 
as  in  Ada  and  Modula-3.  It  has  a  dynamically-scoped  exception  mechanism  to  allow  programs  to 
handle  unusual  conditions. 

SML  encourages  a  non-imperative  programming  style.  Variables  cannot  be  altered  once  they 
are  bound,  and  bv  default  data  structures  cannot  be  altered  once  they  are  created.  Lisp’s  rplaca 
and  rplacd  do  not  exist  for  the  default  definition  of  lists  in  SML.  The  only  kinds  of  assignable  data 
structures  are  ref  cells  and  arrays*,  which  must  be  explicitly  declared.  To  emphasize  the  point, 
assignments  are  permitted  but  discouraged  as  a  general  programming  style.  The  implications  of 
this  non-imperative  programming  style,  for  compilation  are  clear:  SML  programs  tend  to  do  more 
allocation  and  copying  than  programs  written  in  imperative  languages. 

SML  is  most  closely  related  to  Lisp  and  Scheme  [4  1].  Implementation  techniques  for  one  of  these 
languages  are  mostly  applicable  to  the  other  languages,  with  the  following  caveats:  SML  programs 
tend  to  be  less  imperative  than  Lisp  or  Scheme  programs  and  Scheme  and  SML  programs  use 
functions  calls  more  frequently  than  Lisp,  since  recursion  is  the  usual  way  to  achieve  iteration  in 
those  languages. 

2.4  SML/NJ  compiler 

The  SML/NJ  compiler  [4]  is  a  publicly  available  compiler  for  SML.  We  used  version  0.91.  The 
compiler  concentrates  on  making  allocation  cheap  and  function  calls  fast.  Allocation  is  doin'  in- 


Altlioiigli  the  languaiw  clchnitioti  umilti  d  arrays,  all  iniph-nu  illations  have  arrays. 


line,  except  for  the  allocation  of  arrays.  Aggressive  function  inlining  is  used  to  eliminate  functions 
calls  and  their  associated  overhead.  Function  arguments  are  passed  in  registers  when  possible, 
and  register  targeting  is  used  to  minimize  register  shuffling  at  function  calls.  A  split  caller/callee- 
save  register  convention  is  used  to  avoid  excessive  spilling  of  registers  [X],  The  compiler  also  does 
constant-folding,  limited  code  hoisting,  uncurrying,  and  instruction  scheduling. 

The  most  controversial  design  decision  in  the  compiler  was  to  allocate  procedure  activation 
records  on  the  heap  instead  of  the  stack  [1. 6].  In  principle,  the  presence  of  higher-order  functions 
means  that  procedure  activation  records  must  be  allocated  on  the  heap.  With  a  suitable  analysis, 
a  stack  can  be  used  to  store  most  activation  records  [31].  However,  using  only  a  heap  simplifies 
the  compiler,  the  run-time  system  [3].  and  the  implementation  of  first-class  continuations  [2d]. 
The  decision  to  use  only  a  heap  was  controversial  because  it  greatly  increases  the  amount  of  heap 
allocation,  which  is  believed  to  cause  poor  memory  subsystem  performance. 


3  Related  Work 

There  have  been  many  studies  of  the  cache  behavior  of  systems  using  heap  allocation  and  some 
form  of  copying  garbage  collection.  Peng  and  Solti  [38]  examined  the  data  cache  behavior  of  small 
Lisp  programs.  They  used  trace-driven  simulation,  and  proposed  an  ALLOCATE  instruction  for 
improving  cache  behavior,  which  allocates  a  block  in  t he  cache  without  fetching  it  from  memory. 
Wilson  et  nl.  [48.  49]  argued  that  cache  performance  of  programs  with  generational  garbage  col¬ 
lection  will  improve  substantially  when  the  youngest  generation  (its  in  the  cache.  Koopman  <1 
al.  [30]  studied  the  effect  of  cache  organization  on  combinator  graph  reduction,  an  implementa¬ 
tion  technique  for  lazy  functional  programming  languages.  They  observed  the  importance  of  a 
write-allocate  policy  with  subblock  placement  for  improving  heap  allocation.  Zorn  [•">()]  studied 
the  impact  of  cache  behavior  on  the  performance  of  a  Common  Lisp  system,  when  stop-and-copy 
and  mark-and-sweep  garbage  collection  algorithms  were  used.  He  concluded  that  when  programs 
are  run  with  mark-and-sweep  they  have  substantially  better  cache  locality  than  when  run  with 
stop-and-copy. 

Our  work  differs  from  previous  work  in  two  important  ways.  First,  previous  work  used  the 
overall  miss  ratio  as  the  performance  metric,  which  is  a  misleading  indicator  of  performance.  The 
overall  miss  ratio  neglects  the  fact  that  read  and  write  misses  may  have  different  costs.  Also,  the 
overall  miss  ratio  does  not  reflect  the  rates  of  reads  and  writes,  which  may  substantially  affect 
performance.  We  use  memory  subsystem  contribution  to  CPI  as  our  performance  metric,  which 
accurately  reflects  the  effect  of  the  memory  subsystem  on  program  running  time.  Second,  previous 
work  did  not  model  the  entire  memory  subsystem:  it  concentrated  solely  on  caches.  Memory 
subsystem  features  such  as  write  buffers  and  page-mode  writes  interact  with  the  costs  of  hits  and 
misses  in  the  cache  and  should  be  simulated  to  give  a  correct  picture  of  memory  subsystem  behavior. 
We  simulate  the  entire  memory  subsystem. 

Appel  [4]  estimated  CPI  for  the  SML/N.J  system  on  a  single  machine  using  elapsed  time  and 
instruction  counts.  His  CPI  differs  substantially  from  ours.  Apparently  instructions  were  under¬ 
counted  in  his  measurements  [5]. 

Jouppi  [27]  studied  the  effect  of  cache  write  policies  on  the  performance  of  ('  and  Fortran 
programs.  Our  class  of  programs  is  different  from  his.  but  his  conclusions  support  ours:  that  a 
write-allocate  policy  with  subblock  placement  is  a  desirable  architect  lire  feature.  He  found  that  the 
write  miss  ratio  for  the  programs  he  studied  was  comparable  to  the  read  miss  ratio,  and  that  write- 
allocate  with  subblock  placement  eliminated  many  of  the  write  misses.  For  programs  compiled 
with  the  SML/NJ  compiler,  this  is  even  more  important  due  to  the  high  number  of  write  misses 
caused  bv  allocation. 


4  Methodology 

We  used  trace  driven  simulations  to  evaluate  the  memory  subsystem  performance  of  programs 
compiled  with  the  SML/NJ  compiler.  For  trace  driven  simulations  to  be  useful,  there  must  be  an 
accurate  simulation  model  and  a  good  selection  of  benchmarks.  Simulations  that  make  simplifying 
assumptions  about  important  aspects  of  the  system  being  modeled  can  yield  misleading  results.  Toy 
benchmarks,  or  benchmarks  that  are  not  representative  of  t  he  kinds  of  tasks  t  lie  system  is  normally 
used  for,  can  be  equally  misleading.  In  this  work,  much  effort  has  been  devoted  to  addressing  these 
issues. 

Section  4.1  describes  our  trace  generation  and  simulation  tools.  Section  1.2  state's  our  assump¬ 
tions  and  argues  that  they  are  reasonable.  Section  1.4  describes  and  characterizes  the  benchmark 
programs  used  in  this  study.  Section  1.4  describes  the  metrics  used  to  present  memory  subsystem 
performance. 

4.1  Tools 

We  extended  QPT  (Quick  Program  Profiler  and  Tracer)  [43.  0.  42]  to  produce  memory  traces  for 
SML/N.J  programs.  QPT  rewrites  an  executable  program  to  produce  compressed  trace  information: 
QPT  also  produces  a  program  specific  regeneration  program  that  expands  the  compressed  trace 
into  a  full  trace.  Because  QPT  operates  on  t  he  executable  program,  it  can  t  race  bot  It  t  he  SMI.  code 
and  the  garbage  collector  (which  is  written  in  C).  The  significant  trace  compression  achieved  bv 
QPT  allowed  us  to  send  tract's  to  faster  machines  where  they  could  lx*  regenerated  and  simulated 
quickly:  about  50  // s  to  regenerate  and  simulate  each  memory  reference  on  an  IIP  !)()()()  model  720 
machine^. 

Code  produced  by  the  SML/N.J  compiler  presents  three  problems  for  QPT.  First.  SML/NJ  puts 
its  code  in  the  heap.  Since  SML/NJ  uses  a  copying  collector,  code  can  be  moved  just  like  data. 
This  creates  numerous  problems:  we  solve  them  by  putting  SML/NJ  code  in  the  text  segment,  so  it 
is  never  garbage  collected.  Second,  programs  compiled  with  the  SML/NJ  compiler  have  no  symbol 
table  information.  SML/NJ  makes  the  problem  worse  by  interleaving  data  with  the  code.  QPT 
needs  a  symbol  table  to  find  all  the  code.  Third.  SML/NJ  often  implements  function  calls  using 
indirect  jumps.  QPT  needs  to  know  all  the  program  points  that  could  be  targets  of  an  indirect 
jump.  We  solved  both  problems  by  modifying  SML/NJ  to  produce  tables  that  enable  QPT  to 
find  all  targets  of  indirect  jumps  and  to  separate  code  from  data:  we  enhanced  QPT  to  use  t  his 
information. 

We  used  Tycho  [24]  for  the  memory  subsystem  simulations.  Tycho  uses  a  special  case  of  all¬ 
associativity  simulation  [44]  to  simulate  multiple  caches  concurrently.  We  extended  Tycho  in  four 
important  ways.  First,  we  extended  Tvcho  to  separate  read  misses  from  write  misses.  Second,  we 
changed  Tycho  to  simulate  separate  data  and  instruction  caches  simultaneously.  Third,  we  added 
a  write  buffer  simulator  to  Tycho.  The  write  buffer  simulator  can  concurrently  simulate  a  write 
buffer  for  each  cache  organization  being  simulated  by  Tycho.  The  write  buffer  simulator  also  takes 
page-mode  writes  and  memory  refreshes  into  consideration.  Fourth,  we  added  the  write  no  allocate 
write  miss  policy  to  Tycho. 

We  obtained  allocation  statistics  by  using  an  allocation  profiler  built  into  SML/NJ.  The  profiler 
instruments  intermediate  code  to  increment  appropriate  elements  of  a  count  array  on  every  alloca¬ 
tion.  We  extended  this  profiler  to  count  the  number  of  assignment s  done  by  SML/NJ  programs. 

4.2  Simplifications  and  Assumptions 

We  wanted  to  simulate  the  memory  systems  as  completely  as  we  could.  Thus,  we  tried  to  minimize 
assumptions  which  might  reduce  the  validity  of  our  data.  This  section  describes  all  the  important 
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assumptions  made  in  this  study  and  argues  that  bun  .o<  r < •.« ,r , • . . 

1.  Simulating  write  allocate  /siibbloek  plan  no  nl  with  u-nh  „il>*  -■  •  •  •• 

cho  does  not  simulate  subhlock  placement  so  \w  .ippr*  Aim.it*- 

cate/no  subblock  and  ignoring  t lie  reads  from  memon  that  <**•  ■ 
cause  a  small  inaccuracy  in  t  lie  (1*1  mimhers.  I  h*-  follow  m**  <-\au:i.i. 
when  the  simplification  fails. 

Let  us  suppose  we  have  a  cache  block  size  of  2  wor*ls  and  a  - n I* t >i* .«  i.  -  • 
program  issues  a  write  to  the  first  word.  Further  assume  that  ih<-  v. '■*••  -  -  1 

placement,  the  word  will  he  written  to  t  h«>  cache  and  th*1  -«•<••  mi 
will  be  invalidated.  However.  I  he  simplified  model  will  mark  both 
write.  If  the  program  subsequently  issues  a  read  of  tin-  -«•<  **n*i  ■..*!*: 
regarded  as  a  hit.  Thus  the  CPI  reported  Ibr  caches  with  -iibbi..*  h  ■ 
than  the  actual  CPI.  This  is  however  a  rare  occurrence  -dure  sMI  ;u . >ui u, -  . 
assignments  (see  Section  1.3)  and  most  writes  are  to  sequential  lo*aiion- 

2.  Ignoring  the  effects  of  context  switche s  and  sgste  in  colls.  Context  -wit*  h»- 

caused  by  system  calls)  can  affect  cache  performance  significantly  ■!<>  .  W*-  ,gnot«-  ' 1  - 

it  is  an  operating  system  issue  that  affects  all  programs,  not  just  program'  that  ■  •  .• 

intensive. 

3.  Pessimistic  simulation  of  partial  word  writes.  Most  memory  -■ub>\-icm-  u-<-  .■ 
smallest  addressable  unit  and  also  maintain  error  checking  information  on  a  -.'.a-:  -j  o 
Thus,  writes  to  partial  words  ( bytes,  half-words.  <  tc. )  are  moreexpeiisive  t  han  !m!  .‘..a,, 
since  the  enclosing  word  needs  to  be  read,  modified,  its  error  checking  informal  am.  am:  \ 
written  back.  We  charge  11  cycles  for  each  partial-word  write  regardless  of  wii.a!,.  • 

is  in  the  cache.  If  the  word  is  not  in  the  cache,  the  cache  block  is  not  fet<  h*-*l  Iioim  !.•  m**’ 
Also,  the  write  is  not.  queued  up  in  the  write  buffer.  This  is  mostly  consistent  it;,  . 
DECStation  5000/200  model  of  partial  word  writes:  the  key  difference  is  that  v.<-  .u*  .,iv..r. 
assuming  the  worst  case  scenario  (which  is  probably  rare  in  practice). 

This  inaccuracy,  however,  does  not  have  any  significant  impart  on  the  accurarv  of  tlm  -:m 
ulations:  the  CPI  contribution  of  partial  word  writes  is  negligible  even  with  this  pes-mu'in 
model  (see  Section  -5). 

4.  The  simulations  are  driven  by  virtual  addresses.  The  caches  in  many  current  machines  ,m 
physically  indexed  (notable  exceptions  are  the  SPARCs  and  HP  series  700).  This  can  !*<■  ., 
problem  since  the  virtual  address  to  physical  address  mapping  ran  affect  the  conflicts  in  the 
cache.  However  some  virtual  to  physical  mapping  schemes  (e.g.,  a  variation  of  Page  <  idonmi 
used  in  the  MIPS  operating  system)  yield  similar  intra-process  cache  conflicts  as  if  th*'  cache 
was  virtually  indexed  [29].  Thus,  the  simplification  is  reasonable. 

•5.  Placing  code  in  the  text  segment  instead  of  the  heap.  This  improves  performance  «.\*>r  tin* 
unmodified  SML/N.J  system  bv  reducing  garbage  collection  overhead,  since  code  is  never 
copied,  and  by  avoiding  instruction  cache  flushes  after  garbage  collections. 

6.  Used  default  compilation  settings  for  SML/.\J.  Default  compilation  settings  enable  extensive 
optimization.  Evaluating  the  impact  of  these  optimizations  on  cache  behavior  is  beyond  the 
scope  of  this  paper. 

.7.  Used  default  garbage  collection  settings 

We  used  the  default  strategy  for  sizing  the  allocation  area  and  the  old  generation  2j.  Th*> 
heap  is  sized  as  r  times  t  lie  size  of  t  lie  old  generation  alter  the  old  generation  is  rolleried. 
where  r  is  th*'  desired  ratio  of  heap  size  to  live*  data.  r=5  was  used  for  all  I  h<‘  program  runs. 
The  allocation  area  is  sized  as  one-half  of  I  lie  free  space  1 1  lie  heap  space  not  occupied  In  t  he 


s 


old  generation).  As  the  old  generation  grows  after  each  collect  ion  of  i  lie  allocation  area,  the 
free  space  decreases  and  t he  allocation  area  decreases.  This  continues  until  t  lie  old  generation 
is  collected. 

We  did  not  investigate  the  interaction  of  the  sizing  strategy  and  cache  size  [If)].  When  the 
allocation  area  is  larger  than  the  cache,  it  may  he  possible  to  improve  program  Iocalitv  In- 
decreasing  the  size  of  the  allocation  area  so  that  it  fits  in  the  cache.  However,  this  would 
probably  increase  garbage  collection  costs.  I'tidei-standing  these  tradeoffs  is  beyond  the  scope 
of  this  paper. 

In  addition  to  t  he  ratio,  t  he  garbage  collector  is  rout  rolled  by  l  lie  snftnnu •  and  t  lie  initial  In  < i/i 
size.  Pile  softmax  is  a  desired  upper  limit  on  t  lu*  heap  size  which  is  exceeded  only  to  prevent 
programs  from  running  out  of  space.  The  softmax  was  20 M:  the  benchmark  programs  never 
reached  this  limit  and  were  able  to  always  resize- 1  heir  heaps  to  maintain  the  desired  ratio  of 
5.  The  initial  heap  size  was  l.M. 

X.  MIPS  as  a  prototypical  RISC  machint.  All  t  ho  traces  are  for  the  l)K( ‘Station  .">000/200. 
which  uses  a  MIPS  R3000  CPU.  The  results  should  carry  over  to  other  RISC  machines  but 
we  do  not  know  how  applicable  the  results  are  to  CISC  machines. 

9.  All  instructions  take  one  cyclt  with  a  perfect  nu  mory  suhsysh  m.  On  the  l)KCSlaliona000/200. 
this  is  not  true  for  some  instructions  (such  as  multiply,  etc.).  As  far  as  the  memory  subsvs- 
tem  performance  is  concerned,  multi-cycle  instructions  change  only  t  he  write  buffer  penalties: 
multi-cycle  i  list  met  ions  can  give  t  lie  write  buffer  more  op  port  unit  ies  to  ret  ire  writes.  Sect  ion 
5.-1  shows  that  the  write-buffer  overhead  is  small:  thus  the  inaccuracy  introduced  In  ilin 
assumption  will  be  negligible. 

10.  Assuniiny  ('PI  cycle  linn  tints  not  vary  will)  nit  mory  ortjaniztilioii.  I'he  CPI  calculations 
assume  that  the  CPU  cycle  time  remains  the  same  for  different  memory  organizations.  This 
may  not  be  the  case,  since  the  CPU  cycle  time  depends  on  the  cache  access  time,  which  mav 
be  different  for  different  cache  organizations.  For  example,  a  12NK  cache  may  take  longer  to 
access  than  an  XK  cache. 

4.3  Benchmarks 

Table  1  describes  the  benchmark  programs’.  A nuth-Bendir.  Lt ryrn.  Life,  Simple.  YLl IT.  and 
YACC  are  identical  to  the  benchmarks  measured  by  Appel  [l]'\  Table  2  gives  the  sizes  of  the 
benchmarks  in  terms  of  lines  of  SML  code  (excluding  comments  and  blank  lines),  maximum  heap 
size  in  kilobytes,  size  of  the  compiled  code  in  kilobytes  (does  not  include  the  garbage  collector  and 
other  run-time  support  code  which  is  about  (jOK)'.  and  run  time,  in  seconds,  on  a  DFCStation 
5000/200.  The  run  times  are  the  minimum  of  five  runs  (see  Section  5.0). 

Table  3  characterizes  the  benchmark  programs  according  to  the  number  and  kinds  of  memory 
references  they  do.  All  numbers  are  reported  as  a  percentage  of  instruct  ions.  The  Raids.  Writes. 
and  Partial  writes  columns  list  the  reads,  full- word  writes,  and  partial-word  writes  done  by  the 
program  and  the  garbage  collector:  the  assiynnicnts  column  lists  the  non-initializing  writes  done 
by  the  program  only.  The  Nops  column  lists  the  nops  executed  by  the  program  and  the  garbage 
collector.  Note  that  all  the  benchmarks  have  long  traces:  most  related  works  use  traces  that  are 
an  order  of  magnitude  smaller.  Also,  note  that  the  benchmark  programs  do  few  assignments:  the 
majority  of  the  writes  are  initializing  writes. 


'Available  from  the  authors. 

’'The  description  of  these  benchmarks  have  been  copied  from  [l]. 
file  code  size  includes  JOT K  lor  I  lie  standard  libraries. 
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Description 


cvv 

The  Concurrency  Workbench  [15]  is  a  tool  for  analyzing  networks 
of  finite  state  processes  expressed  in  Milner's  Calculus  of  Communi¬ 
cating  Systems.  The  input  is  the  sample  session  from  Section  7.5  of 

ri3i. 

1 

1 

Knuth-Bendix 

An  implementation  o(  the  Knuth-Bendix  completion  algorithm,  im-  | 
piemen  ted  bv  (Jerard  Huet.  processing  some  axioms  of  geometry. 

' 

Lexgen 

A  lexical-analyzer  generator,  implemented  by  James  S.  Mattson  and 
David  R.  Tarditi  [7].  processing  the  lexical  description  of  Standard 
MI.. 

Life 

The  game  of  Life,  written  by  Chris  Reade  [40].  running  50  generations 
of  a  glider  gun.  It  is  implemented  using  lists. 

PIA 

The  Perspective  Inversion  Algorithm  [17]  decides  the  location  of  an 
ob  ject  in  a  perspective  video  image. 

Simple 

A  spherical  fluid-dynamics  program,  developed  as  a  "realistic"  FOR¬ 
TRAN  benchmark  [lb],  translated  into  ID  [21].  and  then  translated 
into  Standard  ML  bv  l.al  (ieorge. 

VLIW 

A  Very-Long-Instruction-Word  instruction  scheduler  written  by  John  j 

Danskin.  | 

YACC 

A  LALR(l)  parser  generator,  implemented  by  David  R.  Tarditi  [II]. 
processing  the  grammar  of  Standard  ML.  i 

Table  I:  Benchmark  Programs 


Program 


Size 

Heap  size  (K) 


( 'ode  size  ( K ) 


CW 

5728 

1107 

894 

Knuth-Bendix 

491 

2708 

251 

Lexgen 

1224 

2162 

105 

Life 

111 

1026 

221 

PIA 

1454 

1025 

291 

Simple 

999 

11571 

114 

VLIW 

1088 

486 

YACC 

5751 

1612 

580 

Table  2:  Sizes  of  Benchmark  Programs 
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Program 

[list  Fetches 

Reads  {%) 

Writes  (%) 

Partial  Writes  (‘X ) 

Assignments  (%) 

Nops  (7,  | 

CW 

523.2-15.987 

17.61 

11.61 

0.01 

0.41 

13.24 

Knuth-Bendix 

512,086.138 

19.66 

22.31 

0.00 

0.00 

5.92 

Lexgen 

328,422.283 

16.08 

10.41 

0.20 

0.21 

12.33 

Life 

413.536.662 

12.18 

9.26 

0.00 

0.00 

15.45 

PIA 

122.215,151 

25.27 

16.50 

0.0(1 

0.00 

8.39 

Simple 

604.611.016 

23.86 

14.06 

0.00 

0.05 

7.58 

VLIW 

399.812.0.33 

17.89 

15.99 

0.10 

0.77 

9.04 

YACC 

133.043,324 

18.49 

14.66 

0.32 

0.38 

11.14 

Table  3:  Characteristics  of  benchmark  programs 


Allocation 
( words ) 

n 

m 

■ 

m 

m 

BBSS 

Ot 

% 

her 

Size 

CW 

4.12 

3.3 

ms 

6.20 

19.5 

1.01 

6.0 

4.00 

Knuth-Bendix 

6.60 

0.1 

mu 

49.5 

4.90 

ran 

1.00 

0.1 

15.05 

Lexgen 

6.20 

5.4 

12.96 

IHB 

6.10 

1.00 

3.7 

6.97 

Life 

37,840,681 

0.2 

3.45 

0.0 

15.00 

77.8 

5.52 

i 

fUTil 

10.29 

PIA 

18.841.256 

0.4 

1.69 

12.7 

1.41 

33.9 

5.22 

Simple 

80.761.644 

1.0 

6.43 

8.3 

1.00 

18.5 

5.41 

VLIW 

59.497.132 

9.9 

rasa 

61.8 

7.67 

20.3 

1.01 

2.1 

2.60 

YACC 

17.015.250 

2.3 

4.83 

05 X 

15.35 

54.8 

7.14 

23.7 

UI4 

4.0 

10.22 

Table  1:  Allocation  characteristics  of  benchmark  programs 


Table  4  gives  the  allocation  statistics  for  each  benchmark  program.  All  allocation  and  sizes  are 
reported  in  words.  The  Allocation  column  lists  the  total  allocation  done  by  the  benchmark.  The 
remaining  columns  break  down  the  allocation  bv  kind :  closures  for  escaping  functions,  closures  for 
known  functions,  closures  for  callee-save  continuations8,  records,  and  others  (includes  spill  records, 
arrays,  strings,  vectors,  ref  cells,  store  list  records,  and  floating  point  numbers).  For  each  allocation 
kind,  the  %  column  gives  the  total  words  allocated  for  objects  of  that  kind  as  a  percentage  of  total 
allocation  and  the  Size,  column  gives  the  average  size  in  words,  including  the  1  word  tag,  of  an 
object  of  that  kind. 

4.4  Metrics 

Following  the  lead  of  recent  work  on  memory  subsystem  performance,  we  state  cache  performance 
numbers  in  cycles  per  useful  instruction  ( CPI).  All  instructions  besides  nops  are  considered  useful. 
Unlike  miss  ratios.  CPI  numbers  give  an  indication  of  how  fast  a  program  will  run.  On  the  down 
side.  CPI  numbers  are  machine  dependent  because  actual  penalties  are  used  in  their  calculations. 

Table  5  lists  the  penalties  used  in  our  simulations.  These  numbers  are  derived  from  the  penalties 
for  the  DECStation  5000/200.  but  are  similar  to  those  in  other  machines  of  the  same  class.  Writes 
have  different  penalties  depending  on  whether  or  not  subblock  placement  is  being  used,  the  block 
size  (and  thus  the  fetch  size),  and  whether  the  writes  hit  or  miss  in  the  cache.  For  caches  with 
subblock  placement,  write  hits  or  misses  have  no  penalty  (besides  write  buffer  related  costs)9.  For 


‘'Closures  for  rallee-save  continuations  can  lie  trivially  allocated  on  a  stack  in  the  absence  of  lirst  class 
continuations. 

'In  an  actual  implementation,  the  penalty  ol  a  miss  may  be  one  cycle  since  unlike  hits,  the  tan  needs  to  be  written 

1  I 


Task 


Penalty  (in  cycles) 


Non-page-mode  write 

5 

Page- mode  write 

1 

Page- mode  flush 

4 

Read  16  bytes  from  memory 

15 

Read  32  bytes  from  memory 

19 

Refresh  period 

195 

Refresh  time 

5 

Write  hit  or  miss  (subblocks) 

0 

Write  hit  (16  bytes,  no  subblocks) 

0 

Write  hit  (32  bytes,  no  subblocks) 

0 

Write  miss  (16  bytes,  no  subblocks) 

15 

Write  miss  (32  bytes,  no  subblocks) 

19 

Table  5:  Penalties  of  memory  operations 


caches  without  subblock  placement,  write  hits  have  no  penalty  (besides  write  buffer  related  costs) 
but  write  misses  cost  15  or  19  cycles  (plus  write  buffer  penalties)  for  block  sizes  of  l(j  and  32  bytes 
respectively.  The  read  miss  and  instruction  fetch  miss  penalty  depends  on  the  block  size:  it  is  15 
cycles  for  a  block  size  of  16  bytes  and  19  cycles  for  a  block  size  of  32  bytes. 

We  used  a  DRAM  page  size  of  4K  in  the  simulation  of  page-mode  writes.  Page-mode  flush  is 
the  number  of  cycles  needed  to  flush  the  write  pipeline  after  a  series  of  page-mode  writes. 

TLB  data  is  reported  as  (CPI  -  CPI  of  perfect  memory  subsystem10).  This  is  the  TLB  contri¬ 
bution  to  the  CPI.  This  metric  is  used  instead  of  just  CPI  to  allow  us  to  present  the  measurements 
for  all  the  benchmarks  in  one  chart.  A  virtual  memory  page  size  of  4K  was  used  in  the  simulations. 
The  penalty  of  a  TLB  miss  is  28  cycles11. 


5  Results  and  Analysis 

In  Section  5.1  we  present  a  qualitative  analysis  of  the  memory  behavior  of  programs  compiled  with 
SML/NJ.  In  Section  5.2  we  list  the  cache  and  TLB  configurations  simulated  and  explain  why  they 
were  selected.  In  Sections  5.3,  5.4,  and  5.5  we  present  data  for  memory  subsystem  performance, 
write  buffer  performance,  and  TLB  performance.  In  Section  5.6  we  validate  the  simulations.  In 
Section  5.7  we  present  an  analytical  model  which  allows  us  to  extend  the  memory  subsystem 
performance  results  to  programs  with  different  allocation  behavior.  In  Section  5.8  we  summarize 
the  results. 


to  the  cache  after  the  miss  is  detected.  This  will  not  change  our  results  since  it  adds  at  most  0.02-0.05  to  the  CPI 
of  caches  with  subblock  placement. 

10The  CPI  of  a  perfect  memory  subsystem  is  the  total  number  of  instructions  divided  bv  the  number  of  useful 
instructions. 

uThis  is  a  weighted  average  of  the  various  kinds  of  TLB  misses  under  Mach  3.0  and  is  derived  from  the  data  in 
[46]. 


Subblocks 

Assoc 

Block  Size 

Cache  Sizes 

Write  Buffer  j 

through 

allocate 

yes 

m 

Hi.  i2  bytes 

8  K  -  '>  1 2  K 

[  1  (i  deep 

through 

allocate 

no 

i.  • 

SK-slZK 

through 

uo  allocate 

no 

i,  _> 

s  K  ">  1 2  K 

(i  deep 

Table  b:  Cache  organizations  st  udied 


5.1  Qualitative  Analysis 

Recall  from  Section  2  that  SML/NJ  uses  a  copying  collector.  The  most  important  property  of  a 
copying  collector  with  respect  to  memory  subsystem  behavior  is  that  allocation  initializes  memory 
in  an  area  that  has  not  been  touched  since'  the  last  garbage'  collection.  This  means  that  lor  cache's 
that  are  not  large  enough  to  contain  the  allocation  area  there  will  be  a  large  number  of  write  misses. 
The  slowdown  that  the  write  misses  translates  into  depend  on  the  memory  subsystem  organization. 

Recall  from  Section  4.3  that  SML/N.I  programs  have  the  following  important  properties.  First, 
they  do  few  assignments:  the  majority  of  the  writes  are  initializing  writes.  Second,  programs  do 
heap  allocation  at  a  furious  rate:  0.1  to  0.22  words  per  instruction.  Third,  writes  come  in  bunches 
because  they  correspond  to  initialization  of  a  newly  allocated  area. 

The  burstiness  of  writes  combined  with  the  property  of  copying  collectors  mentioned  above 
suggests  that  an  aggressive  write  policy  is  necessary.  In  particular,  writes  should  not  stall  the 
CPU.  Memory  subsystem  organizations  where  the  CPU  has  to  wait  for  a  writ*'  to  be  written 
through  (or  back)  to  memory  will  perform  poorly.  Even  memory  subsystems  where  the  CPU  does 
not  need  to  wait  for  writes  if  they  are  issued  far  apart  (e.g..  2  cycles  apart  in  the  HP  9000  series 
700)  may  perform  poorly  due  to  the  bunching  of  writes.  This  leads  to  two  requirements  on  the 
memory  subsystem.  First,  a  write  buffer  or  fast  page  mode  writes  are  essential  to  avoid  waiting 
for  writes  to  memory.  Second,  on  a  write  miss,  the  memory  subsystem  must  avoid  reading  a  cache 
block  from  memory  if  it  is  going  to  be  written  before  being  read.  Of  course,  this  requirement 
only  holds  for  caches  with  a  write-allocate  policy.  Subblock  placement  [30].  a  block  size  of  1  word, 
and  the  ALLOCATE  instruction  [38]  can  all  achieve  this.  Since  the  effects  on  cache  performance 
of  these  features  are  so  similar,  we  talk  just  about  subblock  placement.  For  large  caches,  when 
the  allocation  area  fits  in  the  cache  and  thus  there  are  few  write  misses,  the  benefit  of  subblock 
placement  will  be  reduced. 

5.2  Cache  and  TLB  configurations  simulated 

The  design  space  for  memory  subsystems  is  enormous.  There  are  many  variables  involved  and  the 
dependencies  between  them  are  complex.  Therefore  we  could  study  only  a  subset  of  the  memory 
subsystem  design  space.  In  this  study,  we  restrict  ourselves  to  features  found  in  currently  popular 
RISC  workstations.  Exploration  of  more  exotic  memory  subsystem  features  is  left  to  future  work 
(see  Section  6).  Table  6  summarizes  the  cache  organizations  simulated.  Table  7  lists  the  memory 
subsystem  organization  of  some  popular  machines. 

We  simulated  only  separate  instruction  and  data  caches  (i.e..  no  unified  caches).  While  many 
current  machines  have  separate  caches  (e.g..  DECStations,  HP  700  series),  there  are  some  exceptions 
(notably  SPARC'Stations). 

We  simulated  cache  sizes  of  8K  to  512K.  This  range  includes  the  primary  caches  of  most  current 
machines  (see  Table  7).  We  consider  only  one-way  (direct  mapped)  and  two-way  set  associative 
caches  (with  LRU  replacement). 

We  simulated  block  sizes  of  lb  bytes  and  32  bytes.  Moreover,  fetch  size  is  kept  the  same  as  the 
block  size:  in  particular,  in  caches  with  subblock  placement,  a  read  miss  brings  iu  the  whole  block, 
not  just  the  subblock  causing  the  miss.  In  effect,  this  is  prefetching.  Przybylski  [30]  notes  that 
making  the  fetch  size  equal  to  the  block  size  is  a  good  choice  with  respect  to  memory  subsystem 
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Architecture 

Write  Policy 

Write  Miss  Policy 

W  rite  Buffer 

Suhhlocks 

Assotj  Block  Si/e|(  '.l<  lie  Sj/,. 

DS3100  [19] 

tli  rough 

allocate 

t  deep 

i _ 

1  l)Vt  es  | f »  1  l\ 

DS5000/200  [18] 

through 

allocate 

!>  deep 

ves 

1 

II)  liyles  1)  IK 

HP  9000  [43] 

hack 

allocate 

none 

no 

1 

32  l.vles  lid  K  2M 

SPARCStation  II  [IT] 

through 

no  allocate 

1  deep 

no 

1 

VI  l>v  1  < *s  .9  1  K 
_ ; _ _ _ L_ _ 

•  SPARCStations  have  unified  caches. 

•  Most  HP  9000  series  700  <*at:hes  are  much  smalh*r  than  2M:  I2SK  instim  lion  •  ;u  lie  and  2V»I\  daia  <  \i«  In*  f<»i  m«<d»  U  7_M) 
and  730,  and  250 K  itistrurtion  rarlie  and  25f»K  data  e;u*|i«*  h»r  model  750. 

•  The  DS5000/200 artuallv  fwis  a  Work  size  of  four  bytes  with  a  fW<li  m/«*  <>l  sixteen  bi  tes.  I  Ins  is  sfroimer  /ban  siibhlork 
placement  since  it  has  a  full  tag  on  every  *’siil>blo<  k‘*. 

•  The  higher  end  HP  9000  machines  (model  735  and  above)  provide  a  <  a«  ml  bint  in  then  ximr  msi  rii<'ti>»n>[  I  l] 

The  hint  can  specify  that  a  block  will  be  overwritten  before  being  read;  this  avoids  the  read  if  the  write  misses.  I  be 
SML/NJ  compiler  may  be  able  to  extract  much  of  the  benefits  of  subblock  placement  from  this  feature. 


Table  7:  Memory  subsystem  organization  of  some  popular  machines 

performance.  Przybvlski  also  notes  that  block  sizes  of  l(i  or  •' 12  bytes  optimize  the  read  access 
time  for  the  memory  parameters  used  in  the  CPI  calculations  i  see  Table  5i.  Hereafter,  whenever 
subblock  placement  is  mentioned,  it  is  assumed  that  tin'  letch  size  equals  block  si/e. 

VVe  report  data  only  for  write-through  caches  but  tin*  CPI  for  write-back  caches  can  be  interred 
from  the  graphs  for  write-t  hrough  caches.  While  write-t  lirougli  and  write  back  caches  have  idem  ical 
misses,  their  contribution  to  the  CPI  may  differ  due  to  two  reasons.  First,  a  write  hit  or  miss  in 
a  write-back  cache  may  take  one  cycle  more  than  in  a  write-t  lirougli  cache;  unlike  a  write-through 
cache,  a  write-back  cache  must  probe  the  tag  Infon  writing  to  the  cache  [27].  I  he  graphs  lor 
write-through  caches  can  be  easily  adjusted  to  account  for  this  to  obtain  the  graphs  for  write-back 
caches.  For  instance,  if  the  program  has  w  writes  and  n  useful  instructions,  then  the  CPI  for  a 
write-back  cache  can  be  obtained  bv  adding  w/n  to  the  CPI  of  the  write-through  cache  with  the 
same  size  and  configuration.  For  VLIW  w/n  is  O.IX.  Second,  write-through  and  write-back  caches 
may  have  different  write  buffer  penalties  because  they  do  writes  to  main  memory  with  different 
frequencies  and  at  different  points.  We  expect  the  write  buffer  penalties  for  write-back  caches  to 
be  smaller  than  those  for  write-through  caches  since  writes  to  main  memory  are  less  frequent  for 
write-back  caches  than  for  write-through  caches.  This  difference  bet  ween  write-t  lirougli  and  write¬ 
back  caches  is  likely  to  be  negligible  since  the  write-bulfer  penalty  is  small  even  for  write-t  lirougli 
caches. 

We  varied  write  buffer  depths  from  1  to  0  entries  for  write-through  caches  with  the  write 
allocate / subblock  placement  organization.  VVe  also  simulated  memory  subsystems  with  and  without 
page-mode  writes. 

We  simulated  fully  associative,  unified  TLBs  from  I  to  (id  entries  with  LRU  replacement  policy. 
Some  machines  (such  as  the  HP  9000  series)  have  separate  instruction  and  data  I'LBs.  From 
Section  5.5  it  is  clear  that  for  the  benchmarks  even  small  unified  TLBs  perform  well. 

Two  of  the  most  important  cache  parameters  are  write  alienate  versus  write  no  allocate  and  sub- 
block  placement  versus  no  subblock  placement.  Of  t  hese,  t  he  combinat  ion  write  no  allocate/subblock 
placement  placement  offer  no  improvement  over  write  no  allocate /no  subblock  placeme  nt  for  cache 
performance.  Thus,  we  did  not.  collect  data  for  the  write  no  allex'at.e/subblock  placement  configura¬ 
tion. 

We  restrict  ourselves  only  to  the  first  two  levels  of  the  memory  hierarchy,  which  on  most  current 
machines  corresponds  to  the  primary  cache  and  main  memory.  The  results,  however,  are  mostly 
applicable  when  the  second  level  is  a  secondary  cache  and  the  cost  of  accessing  the  secondary  cache 


is  similar  to  the  cost  of  accessing  main  memory  in  the  l)K( 'Station  .5000/200 IJ.  In  such  machines, 
there  is  a  memory  subsystem  contribution  to  the  CPI  that  we  did  not  measure:  a  miss  on  the  second 
level  cache.  Therefore  the  CPI  obtained  on  these  machines  can  be  higher  than  that  reported  here. 

We  did  not  simulate  the  exotic  features  appearing  on  some  newer  machines,  such  as  st  ream 
buffers,  prefetching,  scoreboarding,  and  victim  caches.  These  features  can  reduce  the  number  of 
cache  misses  and  miss  costs.  Further  work  is  needed  to  understand  the  impact  of  these  features  on 
the  performance  of  heap  allocation. 

5.3  Memory  Subsystem  Performance 

We  present  memory  subsystem  performance  in  summary  graphs  and  breakdown  graphs.  F.ach 
summary  graph  summarizes  the  memory  subsystem  performance  of  one  benchmark  program  for  a 
range  of  cache  sizes  (XK  to  •"> 1 2 K ).  write- miss  policies  ( write  allocate  or  write  no  allocate  |.  subblock 
placement  (with  or  without),  and  associativity  ( I  or  2).  Kacli  curve  in  a  summary  graph  corresponds 
to  a  different  memory  subsystem  organization.  There  are  two  summary  graphs  for  each  program, 
one  for  a  block  size  of  16  bytes  and  another  for  a  block  size  of  32  bytes.  Kacli  breakdown  graph 
breaks  down  the  memory  subsystem  overhead  into  read  misses,  write  misses  (if  there  is  a  penalty 
for  write  misses),  instruction  fetch  misses,  write-buffer  overhead,  and  partial-word  write  overhead 
for  one  configuration  in  a  summary  graph.  The  write-buffer  depth  in  these  graphs  is  fixed  at  6 
entries. 

In  this  section  we  present  only  the  summary  graphs  for  VLIW  ( Figure  2).  The  summary  graphs 
for  other  programs  are  similar  and  are  given  in  Appendix  A.  Figures  3.  1.  and  -5  are  t  lie  breakdown 
graphs  for  VLIW  for  the  16  byte  block  size  configurations:  the  remaining  breakdown  graphs  lor 
VLIW  are  similar  and  omitted  for  conciseness.  The  breakdown  graphs  for  the  other  benchmarks  are 
similar  (and  predictable  from  the  summary  graphs)  and  are  thus  omitted  for  the  same  reason1  J. 

In  the  summary  graphs,  the  nops  curve  is  the  base  CPI:  the  total  number  of  instructions 
executed  divided  by  the  number  of  useful  (not  nop)  instructions  executed:  this  corresponds  to 
the  CPI  for  a  perfect,  memory  subsystem11.  For  the  breakdown  graphs,  the  nop  area  is  the  CPI 
contribution  of  nops:  read  miss  is  the  CPI  contribution  of  read  misses:  write  miss  is  the  CPI 
contribution  of  write  misses  (if  any),  inst  fetch  miss  is  the  CPI  contribution  of  instruction  fetch 
misses:  write  buffer  is  the  CPI  contribution  of  the  write  buffer:  jmrtieil  word  is  the  CPI  contribution 
of  partial-word  writes. 

The  64K  point  on  the  write  alloc,  subbloc k\  assor=l  curves  corresponds  closely  to  the  I) E(  Sta¬ 
tion  5000/200  memory  subsystem. 

In  the  following  subsections  we  describe  the  impact  of  write-miss  policy  and  subblock  placement, 
associativity,  block  size,  cache  size,  write  buffer,  and  partial-word  writes  on  the  memory  subsystem 
performance  of  the  benchmark  programs. 

5.3.1  Write  Miss  Policy  and  Subblock  Placement 

From  the  summary  graphs,  it.  is  clear  that  the  best  cache  organization  we  studied  is  irrite  allo¬ 
cate/subblock  placement-,  it  substantially  outperforms  all  other  configurations.  Surprisingly,  for 
sufficiently  large  caches  with  the  write  nllocate/subblfx'k  placement  organization,  the  memory  sub¬ 
system  performance  of  SMK/N.I  programs  is  acceptable:  the  overhead  due  to  data  cache  misses 
ranges  from  3%  to  13'/  (arithmetic  mean  ?)'/)  for  61K  direct  mapped  caches1’  and  1  {Z  to  13‘/f 
(arithmetic  mean  9%)  for  32 K  two-way  associative  caches.  The  memory  subsystem  performance  of 


1JFor  instance,  Borg  et  al.  [10]  use  12  cycles  as  the  latency  for  going  to  the  second  level  cache  and  2(10  25(1  cycles 
for  going  to  memory. 

nThe  full  set  of  graphs  is  available  via  anonymous  ftp  from  ibis  es. iimass.edu  in  pub/memorv-subsvslem. 

nop*  constitute  between  5.0'/  ,md  15.  I1/  of  all  instructions  executed  for  the  bent  hmarks  I  see  S'ction  t  1) 

1 'Recall  that  this  corresponds  to  the  1)1 .(  Si  at  ion  5(100/200  inemorv  subsystem. 


SML/NJ  programs  on  t ho  DF( 'Station  oOOO/iOO  is  comparable  to  that  of  ( '  and  Fortran  programs 
[12]:  Chen  and  Bershad  find  that  the  data  cache  overhead  of  ( '  and  Fortran  programs  range's  from 
less  than  1%  to  (i(i%.  with  an  arithmetic  mean  of  It  is  worth  emphasizing  that  the  memory 

subsystem  performance  of  SML/NJ  programs  is  ijnotl  on  some  c  urrent  machines  ele  spite  the  re  nj 
high  miss  rates ;  for  a  (i4K  write  allocate  /no  siibblexk  glare  men t  organization  with  a  block  size  of  hi 
bytes,  the  write  miss  and  read  miss  ratios  for  VLIW  are  0.23  and  0.02  respectively. 

Recall  that  in  Section  5.1  we  argued  that  the  benefit  of. subblock  placement  would  be  substantial, 
but  that  the  benefit  would  decrease  for  larger  cache's.  The  summary  graphs  indicate  that  the 
reduction  in  benefit  is  not  substantial  even  for  I2SK  cache  size's:  however,  the  benefit  of  subblock 
placement  decreases  sharply  for  larger  caches  for  six  of  the  benchmark  programs.  This  suggests 
that  the  allocation  area  size  of  six  of  the  benchmark  programs  is  2r>bl\  to  512K. 

The  performance  of  write  allexale  /no  siebblex-k  is  almost  identical  to  t  hat  of  write  nee  eillexeile  /net 
subblock  (Leroy  is  an  exception)1'.  This  suggests  that  an  address  is  being  read  soon  after  being 
written:  even  in  an  XK  cache,  an  address  is  read  after  being  written  before  it  is  evicted  from  tin' 
•  cache  (if  it  was  evicted  from  the  cache  before  being  read,  then  write  alienate  /no  siibblexk  would 
have  inferior  performance).  The  only  difference  between  these  two  schemes  is  when  a  cache  block 
is  read  from  memory.  In  one  case,  it  is  brought  in  on  a  write  miss:  in  the  other,  it  is  brought  in 
on  a  read  miss.  Because  SML/NJ  programs  allocate  sequentially  and  do  few  assignments,  a  newly 
allocated  object  remains  in  the  cache  until  the  program  has  allocated  another  C  bytes,  where  C  is 
the  size  of  the  cache.  Since  the  programs  allocate  0.1  -0.9  bytes  per  instruction,  our  results  suggest 
that  a  read  of  a  block  occurs  within  9K  20 K  instructions  of  its  being  written. 

5.3.2  Changing  Associativity 

From  Figure  2  we  see  that  increasing  associativity  improves  all  organizations.  However  t  he  improve¬ 
ment  in  going  from  one-way  to  two-way  set  associativity  is  much  smaller  than  the  improvement 
obtained  from  subblock  placement:  in  most  cases,  it  improves  the  CPI  by  less  than  0.1.  The 
maximum  benefit  from  higher  associativity  is  obtained  for  small  cache  sizes  (less  than  HiK).  How¬ 
ever,  increasing  associativity  may  increase  CPF  cycle  lime  and  thus  the  improvements  may  not  be 
realized  in  practice  [25]. 

From  Figures  3,  4,  and  ."»  we  see  that  higher  associativity  improves  the  instruction  cache  per¬ 
formance  but  has  little  or  no  impact  on  data  cache  performance.  Surprisingly,  for  direct  mapped 
caches  (Figures  3  (a).  4  (a),  and  5  (a))  the  instruction  cache  penalty  is  substantial  for  128K  or 
smaller  caches.  For  caches  with  subblock  placement,  the  instruction  cache  penalty  can  dominate 
the  penalty  for  the  memory  subsystem.  The  improvement  observed  in  going  to  a  two-way  associa¬ 
tive  cache  suggests  that  a  lot  of  the  penalty  from  the  instruction  cache  is  due  to  conflict  misses 
and  that  from  the  data  cache  is  due  to  capacity  misses:  the  data  cache  is  simply  not  big  enough 
to  hold  the  working  set.  When  the  benchmark  programs  are  examined,  the  performance  of  the 
instruction  cache  is  not  surprising:  the  code  consists  of  small  functions  with  frequent  calls,  which 
lowers  the  spatial  locality.  Thus,  the  chances  of  conflicts  are  greater  than  if  the  instructions  had 
strong  spatial  localitv. 

5.3.3  Changing  Block  Size 

From  Figure  2  we  see  that  increasing  block  size  from  Hi  to  32  bytes  also  improves  performance. 
For  the  write  allocate  organizations,  an  increased  block  size  decreases  the  number  of  write  misses 
caused  by  allocation.  When  the  allocation  area  does  not  lit  in  the  cache,  doubling  the  block  size  can 
halve  the  write-miss  rate.  Thus,  larger  block  sizes  improve  performance  when  there  is  a  penalty 


lsChen  and  Bershad  use  ( 'vrles/ Instruction  rather  than  < 'voles/ Useful  Instruction  which  lowers  their  memory 
subsystem  overhead. 

'  The  difference  between  emit  /ini  sn 4  and  in  ili  im  ulltx-etle  /mi  euhhheck  is  so  small  in  most  graphs 

that  the  two  curves  overlap. 


for  a  write  miss  [30].  In  particular,  larger  block  sizes  have  little  to  offer  to  caches  with  irrite 
allocate/subblcxk  place  me nt.  From  Figure  2  we  see  that  the  write  no  allocate  organizations  benefit 
just  as  much  from  larger  block  size  as  write  allocate/no  subblock  placement ;  this  suggests  that  the 
spatial  locality  of  the  reads  is  comparable  to  that  of  the  writes. 

Note  that  subblock  placement  improves  performance  more  than  even  two-way  associativity  and 
32  byte  blocks  combined. 

5.3.4  Changing  Cache  Size 

Three  distinct  regions  of  performance  can  be  identified  for  cache  sizes.  The  first  region  corresponds 
to  the  range  of  cache  sizes  when  the  allocation  area  does  not.  fit  in  the  cache  (i.e..  allocation 
happens  in  an  area  of  memory  which  is  not  cache  resident).  For  most  of  the  benchmarks,  this 
region  corresponds  to  cache  sizes  of  less  than  256  K  (for  Simple  and  Knuth-Bendix  this  region 
extends  beyond  512K ).  In  this  region,  increasing  the  cache  size  uniformly  improves  performance 
for  all  configurations.  However,  the  performance  improvement  from  doubling  the  cache  size  is  small. 

From  the  breakdown  graphs  we  see  that  in  the  first  region  the  cache  size  has  little  effect  on  the 
data  cache  miss  contribution  to  CPI.  Most  of  the  improvement  in  CPI  that  comes  from  increasing 
the  cache  size  is  due  to  improved  performance  of  the  instruction  cache.  As  with  associativity,  cache 
sizes  have  interactions  with  the  cycle  time  of  the  CPU:  larger  caches  can  take  longer  to  access. 
Thus,  improvement  due  to  increasing  the  cache  size  may  not  be  achieved  in  practice. 

The  second  region  ranges  from  when  the  allocation  area  begins  to  fit  in  the  cache  until  the 
allocation  area  fits  in  the  cache.  For  most  of  the  benchmarks  (once  again  excepting  Simple  and 
Knuth-Bendix),  this  region  corresponds  to  cache  sizes  in  the  range  256  K  to  5 1 2 K 1  s .  In  this  region, 
increasing  the  cache  size  sharpiv  improves  the  data  cache  performance  for  memory  organizations 
without  subblock  placement.  However,  increasing  the  cache  size  in  this  region  has  little  to  offer 
for  instruction  cache  performance  because  the  instruction  cache  miss  penalty  is  already  low  at  this 
point. 

The  third  region  corresponds  to  cache  sizes  when  the  allocation  area  fits  in  the  cache.  For  five 
of  the  benchmarks,  this  region  corresponds  to  caches  larger  than  512K  (for  Lexgen.  Knuth-Bendix. 
and  Simple  this  region  starts  at  larger  cache  sizes).  In  this  range,  increasing  the  cache  size  has 
little  or  no  impact  on  memory  subsystem  performance  because  everything  remains  cache  resident 
and  thus  there  are  no  capacity  misses  to  eliminate. 

5.3.5  Write  Buffer  and  Partial- Word  Write  Overheads 

From  the  breakdown  graphs  we  see  that  the  write  buffer  and  partial  word  write  contribution  to  the 
CPI  is  negligible.  A  six  deep  write  buffer  coupled  with  page-mode  writes  is  sufficient  to  absorb  the 
bursty  writes.  As  expected,  memory  subsystem  features  which  reduce  the  number  of  misses  (such 
as  higher  associativity  and  larger  cache  sizes)  also  reduce  the  write  buffer  overhead. 


’’For  Lexgen  this  region  extends  a  little  beyond  ">  12 K . 
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5.4  Write-buffer  depth 

In  Section  5.3.5  we  showed  that  a.  six-deep  write  buffer  coupled  with  page- mode  writes  was  able  io 
absorb  the  bursty  writes  in  SML/NJ  programs.  In  t  his  section  we  explore  i  he  impact  of  write  buffer 
depth  on  the  write-buffer  contribution  to  CPI.  Since  the  speed  at  which  the  write  buffer  can  retire 
writes  depends  on  whether  or  not  t lie  memory  subsystem  has  page- mode  writes,  we  conducted  two 
sets  of  experiments.  In  the  first  set.  we  simulated  a  memory  subsystem  with  page-mode  writes  and 
varied  the  write-buffer  depth  from  1  to  0.  In  the  second  set.  we  simulated  a  memory  subsystem 
without  page-mode  writes  and  varied  the  write-buffer  depth  from  I  to  b.  Wo  conducted  this  study 
for  two  of  the  larger  benchmarks:  CW  and  VLIW.  We  fixed  the  block  size  at  Hi  bytes  and  the  write 
miss  policy  at  write  eillorutr /subblock  plan  m<  nl. 

Figure  6  gives  the  write  buffer  overheads  for  VLIW  with  caches  of  associativity  one  and  two  and 
in  a  memory  subsystem  with  page-mode  writes:  f  igure  ?  does  the  same  in  a  memory  -ubsystem 
without  page-mode  writes.  The  graphs  plot  the  CPI  contribution  of  the  write  buffer  against 
cache  size:  there  is  one  curve  for  each  write-buffer  depth.  (Iraphs  for  CW  are  omitted  for  space 
considerations.  Increasing  the  cache  size  or  associativity  reduces  the  number  of  read  and  instruction 
fetch  misses,  and  thus  reduces  the  number  of  main  memory  transactions.  This  reduces  the  write- 
buffer  contribution  to  the  CPI  in  four  ways: 

l..The  write  buffer  has  more  cycles  to  retire  its  entries  and  hence  rhe  write  bujjf  r  full  Mails 
occur  less  frequently l'\ 

2.  In  the  memory  subsystem  with  page- mode  writes,  the  main  mentors  is  thrown  out  of  page 
mode  less  frequently,  allowing  the  write  buffer  to  retire  writes  quickly’".  F  It  is  reduce-  the 
tcrite  buffer  full  stalls. 

3.  Since  there  are  fewer  reads  to  main  memory,  the  number  of  times  a  read  to  main  memory 
needs  to  wait  for  a  write  to  finish  is  less,  thus  reducing  the  mum  m<  mory  busy  delays. 

4.  Since  there  are  fewer  reads  to  main  memory,  a  read  to  main  memory  conflicts  with  a  write 
buffer  entry  less  frequently,  thus  reducing  the  write  buffer  conflict  delays. 

In  memory  subsystems  with  page-mode  writes  (Figure  (j).  the  difference  between  the  CPI  con¬ 
tribution  of  a  one-deep  write  buffer  and  a  six-deep  write  buffer  is  less  t  han  0.05.  This  is  surprisingly 
small  considering  the  burstiness  of  the  writes.  This  is  due  to  the  effectiveness  of  page-mode  writes: 
an  example  illustrates  this: 

Suppose  that  a  SML/NJ  program  is  allocating  (and  initializing)  an  object  which  is  t  words 
in  size  and  that  the  write  buffer  is  one  deep.  Further  suppose  that  the  write  buffer  is  empty  and 
that  the  instructions  doing  the  allocation  all  hit  in  the  instruction  cache.  The  first  write  does  not 
stall  the  CPU  since  the  write  buffer  is  empty.  The  next  write  comes  one  cycle  later,  finds  a  full 
write  buffer,  and  thus  stalls  the  CPU.  After  4  cycles  (see  penalties  in  Table  5).  the  write  is  queued 
up  in  the  write  buffer.  This  write,  however  is  highly  likely  to  be  on  the  same  DRAM  page  as  the 
previous  write  (since  it  is  to  the  next  address)  and  will  therefore  take  only  one  cycle  to  complete. 
All  subsequent  writes  to  initialize  this  object  find  an  empty  write  buffer  since  they  all  complete  in 
one  cycle  due  to  page-mode  writes. 

As  noted  above,  all  the  writes  to  initialize  an  object  are  likely  to  be  on  the  same  page  and  can 
thus  take  advantage  of  page-mode  writes.  Due  to  sequential  allocation,  it  is  likely  that  writes  to 
initialize  objects  allocated  one  after  another  will  also  be  on  the  same  DRAM  page.  Thus,  in  the 
best  case  (with  no  read  misses  and  refreshes),  a  write  buffer  full  delay  will  happen  only  once  per 
N  words  of  allocation,  where  N  is  the  size  of  the  DRAM  page.  Thus,  the  write  buffer  depth  has 
little  performance  impact  on  SML/NJ  programs  if  the  memory  subsystem  has  page-mode  writes. 


11  Recall  that  a  write  buffer  uses  free  memory  eyries  to  retire  its  writes. 
■^Recall  that,  reads  throw  main  memory  out  of  page  mode. 


To  confirm  this  explanation,  we  measured  the  probability  of  two  consecutive  writes  being  on  the 
same  DRAM  page.  This  probability  (averaged  over  the  benchmarks)  was  90%. 

The  small  impact  of  write  buffer  depth  on  performance  does  not  imply  that  a  write  buffer  is 
useless  if  the  memory  system  has  page-mode  writes.  Instead,  it  says  that  a  write  buffer  offers  little 
performance  improvement  in  a  memory  subsystem  with  page-mode  writes  if  the  programs  have 
strong  spatial  locality  in  the  writes,  and  the  majority  of  the  reads  and  instruction  fetches  hit  in 
the  cache.  Strong  spatial  locality  means  that  the  probability  that  two  consecutive  writes  are  to  the 
same  DRAM  page  is  very  high. 

Write-buffer  depth  is  however  important  if  the  memory  subsystem  does  not  have  page- mode 
writes  (Figure  7).  A  six-deep  write  buffer  performs  substantially  better  than  a  one-deep  write 
buffer  in  a  memory  system  without  page-mode  writes. 

5.5  TLB  Performance 

Figure  8  gives  the  TLB  miss  contribution  to  the  CPI  for  each  benchmark  program.  We  see  that 
CPI  contribution  of  TLB  misses  falls  below  0.01  for  all  our  programs  for  a  01  entry  unified  TLB: 
for  half  the  benchmarks,  it  is  below  0.01  even  for  a  32  entry  TLB. 

5.6  Validation 

To  validate  our  simulations,  we  ran  each  of  the  benchmarks  five  times  on  a  DECStation  5000/200 
(running  Mach  2.6)  and  measured  the  user  time  for  each  run.  The  programs  were  run  on  a 
lightly  loaded  machine  but  not  in  single-user  mode.  The  simulations  with  irrih  ullocnli  ■subblack 
placement ,  64K  direct-mapped  caches.  16  byte  blocks,  and  64  entry  TLB  corresponds  closely  to  the 
DECStation  5000/200  with  the  following  important  differences: 

•  The  simulations  ignored  the  effects  of  context  switches  and  system  calls.  Thus,  act  uai  program 
runs  suffered  more  data  and  instruction  cache  misses  than  those  reported  bv  the  simulations 

[36]. 

•  The  simulations  assumed  a  virtual  address =physical  address  mapping.  Kessler  and  Hill  [29] 
show  that  random  mapping  (as  used  in  the  actual  runs)  can  have  many  more  conflict  misses 
than  a  careful  mapping  (such  as  that  assumed  by  the  simulations).  Thus,  the  actual  runs 
probably  suffered  more  conflict  misses  than  those  reported  by  the  simulations. 

•  The  simulations  assumed  that  all  instructions  take  exactly  one  cycle  (plus  memory  subsystem 
overhead).  Some  of  the  benchmarks  do  multiplications  and  divisions  (bot  h  of  which  take  more 
than  one  cycle).  Thus,  the  actual  program  runs  may  take  more  cycles  to  complete  than  the 
cycles  predicted  by  the  simulations. 

In  order  to  minimize  the  memory  subsystem  effects  of  the  virtual  to  physical  mapping  and 
context  switches,  we  took  the  minimum  CPI  of  the  five  runs  for  each  program  and  compared  it 
to  the  CPI  obtained  via  simulations.  We  present  our  findings  in  Table  8;  Measured  (sec)  is  the 
user  time  of  the  program  in  seconds:  Measured  CPI  is  the  CPI  obtained  from  the  measured  t  ime: 
Simulated  CPI'is  the  CPI  obtained  from  the  simulations:  Difference  is  the  difference  between  the 
measured  CPI  and  the  simulated  CPI;  Discrepancy  is  the  difference  as  a  percentage  of  measured 
CPI. 

Table  8  shows  that  with  the  exception  of  PIA  and  VLIW.  the  discrepancy  is  small  (i.e..  less 
than  10%);  the  actual  runs  validate  the  simulations.  The  discrepancy  in  PIA  and  VLIW  is  due 
to  the  significant  number  of  multi-cycle  instructions  they  execute21.  Table  9  lists  the  multi-cycle 
instructions  executed  by  each  program22.  Total  is  the  percentage  of  instruct  ions  which  are  divisions. 


21  In  this  section,  multi-cycle  instructions  refer  to  i nicker  multiplication  and  division,  and  lloalini;  point  operations. 
22SML/N.I  uses  only  the  "double-  versions  of  each  float  ini'  point  instruction. 
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Figure  8:  TLB  contribution  to  CPI 


Program 

Measured  (sec) 

Measured  CPI 

Simulated  CPI 

Dilference 

Discrepancy  (VI) 

CW 

25.83 

1.12 

1.39 

(1.0.3 

2.  ix 

Knuth-Bendix 

14.95 

1.27 

1.21 

5.22 

Lexgen 

16.13 

1.40 

1.31 

0.09 

6.29 

Life 

17.16 

1.23 

1.21 

0.02 

1.19 

PIA 

6.41 

1.43 

1.18 

0.25 

17.62 

Simple 

>9.81 

1.33 

1.21 

0.12 

9.03 

VLIW 

25.61 

1.76 

1.39 

0.37 

20.77 

YACC 

6.58 

1.39 

1.36 

0.03 

2.20 

Table  8:  Measured  versus  Simulated 

multiplications,  floating  point  additions,  or  floating  point  subtractions:  l  Div  and  [  Mul  are  the 
percentages  of  integer  division  and  multiplication  respectively:  F  Add.  F  Sub.  F  Div.  F  Mul  are  the 
percentages  of  floating  point  additions,  subtractions,  divisions,  and  multiplications  respectively. 

The  actual  impact  of  multi-cycle  instructions  on  CPI  can  be  determined  only  by  simulations. 
This  is  because  on  a  DECStation  5000/200.  the  CPU  does  not  need  to  wait  after  issuing  a  multi¬ 
cycle  instruction.  However,  if  the  CPU  tries  to  read  the  result  of  a  multi-cycle  instruction,  it 
stalls  until  that  instruction  is  complete.  Moreover,  the  number  of  cycles  needed  for  a  floating 
point  instructions  depends  on  what  other  operations  are  currently  in  progress  in  the  floating  point 
coprocessor.  Table  10  gives  the  latencies  (in  cycles)  for  the  different  multi-cycle  instructions.  The 
cycles  for  the  floating  point  multiplication  and  division  are  lower  bounds. 

To  test  whether  multi-cycle  instructions  could  explain  the  high  discrepancies  in  PIA  and  VLIW. 
we  added  the  overhead  of  multi-cycle  instructions  to  the  simulated  CPI  assuming  that  all  multi¬ 
cycle  instructions  stalled  the  CPU  for  the  cycles  listed  in  Table  10.  This  yielded  a  simulated  CPI 
of  1.41  for  PIA  and  1.59  for  VLIW.  This  reduced  the  discrepancy  to  1.4%  for  PIA  and  9.7%  for  VLIW. 

On  examining  the  assembly  code  generated  for  PIA,  we  found  that  the  distance  between  multi¬ 
cycle  instructions  and  use  of  their  results  varied  significantly.  Moreover,  in  many  instances  the 
assembly  code  had  bunches  of  multiplications  and  divisions:  these  cause  resource  conflicts  in  the 
floating-point  coprocessor  thus  causing  them  to  have  longer  latencies  than  t  hose  in  fable  10.  There¬ 
fore.  without  simulating  multi-cycle  instructions,  we  cannot  determine  their  exact  penalty  in  PIA. 
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Program 

Total 

I  Div 

F  Div 

F  Mul 

CW 

0.00 

0.00 

0.00 

0.00 

0.00 

Knuth-Bendix 

0.00 

0.00 

0.00 

0.00 

Lexgen 

0.04 

0.02 

0.00 

0.00 

Life 

0.00 

0.00 

0.00 

0.00 

0.00 

PIA 

4.08 

0.00 

0.00 

1.30 

0.38 

0.84 

1.56 

Simple 

1.67 

0.00 

0.50 

0.30 

0.14 

0.06 

0.67 

VLIW 

0.95 

0.32 

0.00 

0.00 

YACC 

0.01 

0.01 

0.00 

0.00 

Table  9:  Multi-cycle  instructions  as  a  percentage  of  instruction  count 


Integer 

Floating  Point 

Multiplication 

13 

1 

Division 

36 

18 

Addition 

- 

1 

Subtraction 

- 

1 

Table  10:  Multi-cycle  instruction  cost  on  a  DECStation  5000/200 


However,  a  simple  calculation  shows  that  even  if  each  multi-cycle  instruction  stalls  the  CPU  for  half 
the  time  reported  in  Table  10,  the  discrepancy  falls  well  below  10%.  Thus,  multi-cycle  instructions 
can  explain  the  discrepancy  for  PIA. 

From  profiling  VLIW  we  found  that  the  vast  majority  of  the  multi-cycle  instructions  came  from 
one  routine,  mod,  in  the  SML/NJ  standard  library.  On  examining  the  assembly  code  for  mod.  we 
found  that  the  results  of  the  multiplications  were  used  immediately,  and  the  results  of  the  divisions 
were  used  either  immediately  or  one  instruction  later.  Thus  each  multiplication  stalled  the  CPU 
for  13  cycles  and  each  division  stalled  the  CPU  for  3523  or  36  cycles.  Thus,  it  is  reasonable  to  use 
the  numbers  in  Table  10  to  compute  CPI  overhead  of  multi-cycle  instructions.  Thus,  multi-cycle 
instructions  can  explain  the  discrepancy  for  VLIW. 

5.7  Extending  the  results 

Section  5.3  demonstrated  that  heap  allocation  can  have  a  significant  memory  subsystem  cost  if  it 
is  not  possible  to  allocate  a  new  object  directly  into  the  cache.  In  this  section,  we  present  and 
evaluate  an  analytic  model  which  predicts  the  memory  subsystem  cost  due  to  heap  allocation  when 
this  is  the  case.  This  model  formalizes  the  intuition  presented  in  Section  5.1.  It  allows  us  to  predict 
the  memory  subsystem  cost  due  to  heap  allocation  when  block  sizes,  miss  penalties,  or  program 
heap  allocation  rates  change.  We  use  the  model  to  speculate  about  the  memory  subsystem  cost  of 
heap  allocation  for  caches  without  subblock  placement  if  SML/N.l  were  to  use  a  simple  stack. 


’Assuming  the  instruction  (always  arithmetic)  between  the  division  and  use  of  its  result  hits  in  the  cache. 
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5.7.1  An  analytic  model 

Recall  that  heap  allocation  with  copying  garbage  collection  typically  allocates  memory  which  has 
not  been  touched  in  a  long  time,  and  thus  is  unlikely  to  be  in  the  cache,  '['his  is  especially  true  when 
the  allocation  area  does  not  fit  in  the  cache,  rims,  when  newly  allocated  memory  is  initialized,  write 
misses  occur.  The  rate  of  write  misses  depends  upon  the  allocation  rate  and  the  block  size,  (liven 
the  rate  of  write  misses,  we  can  calculate  the  memory  subsystem  cost.  C,  due  to  heap  allocation. 

a  =  allocation  rate  (words/useful  instruction) 
b  =  block  size  (words) 

rp  =  read  miss  penalty  (cycles) 

wp  =  write  miss  penalty  (cycles) 

Then  under  the  assumption  that  the  allocation  area  does  not  tit  in  the  cache,  i.c.  initializing 
writes  miss. 

C write  alloc  =  Wp  *  a/b 

The  cost  of  allocating  one  word  on  the  heap,  A.  will  be 

*^write  alloc  =  WP  / 

Note  that  depending  on  the  cache  organization,  the  write  miss  penalty  may  be  f). 

Under  the  additional  assumption  that  programs  touch  allocated  data  soon  after  it  is  allocated. 

^  write  no  alloc  =  rP  *  a/ ^ 
write  no  alloc  ~  rp/^ 

The  cost  of  heap  allocation  should  account  for  the  difference  in  simulated  CPIs  when  the  write 
miss  policy  is  varied  for  the  SML/NJ  benchmarks,  since  the  benchmarks  do  so  few  assignments. 
That  is, 

^write  alloc/no  subblock  ~  ^'^write  alloc/no  subblock  '  ^write  alloc/subblock 

^write  no  alloc/no  subblock  ~  ^P^write  no  alloc/no  subblock  "  ^  P^write  alloc/subblock 

Table  11  shows  the  average  percentage  difference  between  the  cost  of  heap  allocation,  C,  and  the 
differences  in  the  CPIs.  The  percentage  difference  for  write  allocate/no  subblock.  D.  was  calculated 
as 


CPIdiff 

^write  alloc/no  subblock 


ripi  ('pi 

v-  r ‘write  alloc/no  subblock  ’  write  alloc/subblock 


C  -CPI 

write  alloc/no  subblock  diff 

- CPI 


Miff 


The  percentage  difference  for  write  no  alloc/no  subblock  was  calculated  similarly.  We  fixed 
the  block  size  to  be  16  bytes.  Recall  that  the  miss  penalties  are  wp  =  rp  =  15.  We  calculated 
the  allocation  rates  (Table  12)  for  programs  by  using  the  allocation  information  from  Table  I  and 
instruction  counts  from  Table  3.  The  average  was  the  arithmetic  mean.  The  average  difference 
when  the  allocation  area  does  not  fit  in  the  cache  ( 128 K  or  less)  is  small  (2-32%).  When  the 
assumption  that  the  allocation  area  does  not  fit  in  the  cache  is  violated,  the  model  is  inaccurate,  as 
expected.  The  percentage  difference  heads  towards  infinity  as  (  T,diff  becomes  very  small.  Thus, 
this  model  can  be  used  to  predict  the  memory  subsystem  cost  of  heap  allocation  only  for  small 
cache  sizes. 


2X 


Cache  size 

( Kilobytes) 

^write  no  alloc/no  subblock 

(%) 

^write  alloc/no  subblock 

ex) 

8K 

7.12 

2.1 

L6K 

6.84 

2.2 

32  K 

7.02 

2.2 

64  K 

10.8 

">.7 

128  K 

.41.8 

24.5 

256  K 

128.8 

111.1 

512K 

1847.7 

1746.2 

Table  11:  Percent  difference  between  analytical  model  and  simulations 


Program 

Allocation  rate 
including  callee-save  conts. 
(words/useful  instruction) 

Allocation  rate 
excluding  callee-save  conts. 

( words/ useful  instruction ) 

CW 

0.12 

0.04 

Knuth-Bendix 

0.24 

0.12 

Lexgen 

0.11 

0.04 

Life 

0.11 

0.02 

PIA 

0.17 

0.14 

Simple 

0.14 

0.05 

VLIW 

0.16 

0.06 

YACC 

0.14 

0.07 

Median 

0.14 

0.05 

Table  12:  Allocation  rate  for  benchmarks,  including  and  excluding  callee-save  continuations,  which 
can  be  stack-allocated. 
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Program 

C 

(cycles/instruction ) 

(W 

0.15 

Knuth-Bendix 

0,14 

Le.xgeu 

0.12 

Life 

0.09 

PI  A 

0.17 

Simple 

0.17 

VLIW 

0.23 

YACC 

0.2  1 

Table  13:  Assuming  procedure  activation  records  are  stack  allocated  in  SML/NJ.  this  table  presents 
the  expected  memory  subsystem  cost  of  heap  allocation  for  caches  without  subblock  placement 


5.7.2  SML/NJ  with  a  stack 

We  can  use  this  model  to  speculate  about  the  memory  subsystem  cost  of  heap  allocation  in  SML/NJ 
when  a  stack  is  used.  In  the  absence  of  first-class  continuations,  which  the  benchmarks  do  not  use. 
callee-save  continuations  can  be  easily  stack-allocated.  The  calico-save  continuations  correspond  to 
procedure  activation  records.  Table  12  shows  that  stack-allocating  callee-save  continuations  would 
greatly  reduce  the  allocation  rate  of  the  benchmarks. 

Assuming  only  continuations  are  stack-allocated.  Table  13  presents  an  estimate  of  the  memory 
subsystem  cost  of  heap  allocation  for  caches  that  do  not  have  subblock  placement  and  are  too  small 
to  hold  the  allocation  area.  The  block  size  is  lb  bytes,  the  read  miss  penalty  15  cycles,  and  the 
write  miss  penalty  for  the  no-subblock  caches  15  cycles. 

This  is  an  upper  bound  estimate  of  expected  memory  subsystem  cost  of  heap  allocation  with 
a  stack  because  it  may  be  possible  to  stack-allocate  additional  objects  [31].  We  see  that  even  with 
a  simple  stack,  the  memory  subsystem  costs  due  to  heap  allocation  for  caches  without  subblock 
placement  will  probably  be  significant  for  SML/NJ  programs. 

5.8  Summary  of  Results 

Contrary  to  what  other  researchers  have  speculated,  we  have  found  that  the  memory  subsystem 
performance  of  SML/NJ  is  quite  good  on  some  real  machines.  Of  the  cache  organization  parameters 
we  studied,  write  allocate  /subblock  placement  with  a  subblock  size  of  1  word  is  most  important 
for  good  performance  of  SML/NJ  programs.  However,  small  caches  perform  badly  for  all  cache 
organizations.  Also,  DECStations  are  the  only  machines  whose  caches  have  subblock  placement 
with  a  subblock  size  of  1  word:  thus,  the  memory  subsystem  performance  of  SML/NJ  programs  is 
bad  on  most  current  machines. 

Higher  associativity  and  larger  block  sizes  also  improve  performance  but  the  improvement  is  not 
as  significant  as  that  offered  by  subblock  placement.  Larger  cache  sizes  also  improve  performance, 
but  for  cache  sizes  up  to  I28K  the  improvement  is  small,  for  six  of  the  benchmarks,  increasing  the 
cache  sizes  beyond  128k  allows  the  allocation  area  to  fit  in  the  cache:  thus  increasing  the  cache 
size  beyond  128K  ran  be  profitable. 

Most  surprisingly,  higher  associativity  and  larger  cache  sizes  (up  to  128k)  have  little  effect  on 
the  performance  of  the  data  cache:  most  of  the  overall  improvement  observed  is  in  the  instruction 
cache.  The  bad  locality  of  the  instructions  due  to  small  functions  and  frequent  calls  leads  to  many 
conflict  misses  in  the  instruction  cache,  which  can  be  alleviated  by  going  to  a  larger  cache  size  or 
higher  associativity. 
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We  found  fast  page  mode  writes  to  bo  very  effective  in  absorbing  the  bursty  writes  of  SMI. /-Vi 
programs.  In  memory  subsystems  with  page-mode  writes,  t  lie  write-buffer  dept  h  was  not  important : 
a  one-deep  write  buffer  performed  almost  as  well  as  a  six-deep  write  buffer.  In  memory  subsystems 
without  page-mode  writes,  the  write  buffer-depth  was  important:  a  one-deep  write  buffer  performed 
much  worse  than  a  six-deep  write  buffer. 

Finally,  we  found  the  penalty  due  to  TLB  misses  to  be  small  for  TLBs  with  '12  or  more  entries. 


6  Future  Work 

We  suggest  three  directions  in  which  this  study  can  be  extended: 

•  measuring  the  impact  of  other  architectural  features  not  explored  in  this  work. 

•  measuring  the  impact  of  different  compilation  techniques,  and 

•  measuring  other  aspects  of  programs. 

Regarding  architectural  features,  there  is  a  need  to  explore  memory  subsystem  performance 
of  heap  allocation  on  newer  machines.  As  (Tibs  get  faster  relative  to  main  memory,  memory 
subsystem  performance  becomes  even  more  crucial  to  good  performance.  To  address  the  increas¬ 
ing  discrepancy  between  CPU  speeds  and  main  memory  speeds,  newer  machines,  such  as  Alpha 
workstations  [20].  often  have  features  such  as  secondary  caches,  stream  buffers,  and  register  score¬ 
boarding. 

Secondary  caches  improve  performance  by  reducing  accesses  to  main  memory.  Stream  buffers 
and  scoreboarding  improve  performance  by  reducing  the  latency  of  cache  misses.  The  impact  of 
these  features  on  memory  subsystem  performance  can  be  determined  only  by  simulat  ions.  Previous 
work  has  addressed  at  least  two  of  the  features  in  isolation:  Short  and  Levy  [12].  Borg  <  I  at.  [10]. 
and  Przybylski  [39]  study  two-level  caches.  Jouppi  [2b]  studies  stream  buffers,  and  (  lien  and  Baer 
[13]  study  scoreboarding.  However,  we  are  not  aware  of  any  published  work  which  has  studied  a 
memory  subsystem  with  all  (or  a  combination)  of  these  features.  Also,  we  are  not  aware  of  any 
work  evaluating  the  impact  of  these  features  on  heap  allocation. 

Regarding  different  compilation  techniques,  the  impact  of  stack  allocation  is  worth  measuring. 
A  stack  reduces  heap  allocation  (which  performs  badly  on  most,  memory  subsystem  organizations) 
in  favor  of  stack  allocation  (which  can  have  good  cache  locality  since  it  focuses  most  of  the  references 
to  a  small  part  of  memory,  namely  the  top  of  the  stack).  For  SML/N.J  programs,  the  majority 
of  heap  allocated  objects  can  be  allocated  on  the  stack  (  Iable  I).  Therefore  stack  allocation 
can  substantially  improve  performance  of  SML/NJ  programs  on  memory  organizations  without 
subblock  placement  or  with  small  cache  sizes.  However,  stack  allocation  can  slow  down  exceptions, 
first-class  continuations,  and  threads.  A  careful  study  is  needed  to  evaluate  the  pros  and  cons  of 
doing  stack  allocation.  We  are  currently  working  on  this. 

Regarding  measuring  other  aspects  of  programs,  several  areas  seem  promising  for  future  work: 

1.  Measuring  the  impact  of  different  garbage  collection  algorit  hms  on  cache  performance.  Some 
work  has  already  been  done  on  this  but  more  needs  to  be  done  (see  Section  3). 

2.  Measuring  the  impact,  of  changing  various  garbage  collector  parameters  (such  as  allocation 
area  size)  on  cache  performance.  We  are  currently  working  on  this. 

3.  Measuring  the  cost  of  various  operations  related  to  garbage  collection:  tagging,  store  checks, 
and  garbage  collection  checks.  A  preliminary  study  of  this  is  reported  in  [  lb]. 

1..  Measuring  the  impact  of  optimizations  on  cache  performance.  Of  special  interest  here  is  tin' 
effect  of  function  inlining.  We  are  currently  working  on  this. 
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7  Conclusions 


We  have  studied  the  memory  subsystem  performance  of  heap  allocation  with  copying  garbage 
collection,  a  general  automatic  storage  management  technique  for  modern  programming  languages. 
Heap  allocation  is  useful  for  implementing  language  features  such  as  list-processing,  higher-order 
functions,  and  first-class  continuations  where  objects  may  have  indefinite  extent.  However,  heap 
allocation  is  widely  believed  to  have  poor  memory  subsystem  performance  [JN.  IN.  (().  50].  This 
belief  is  based  on  the  high  (write)  miss  ratios  that  occur  when  new  objects  are  allocated  and 
initialized. 

We  studied  the  memory  subsystem  performance  of  most  Iv-functional  SML  programs  compiled 
with  the  SML/N.J  compiler.  These  programs  heap  allocate  at  intensive  rates.  They  use  heap-only 
allocation:  all  allocation,  including  activation  records,  is  done  on  the  heap.  We  simulated  a  wide 
variety  of  memory  subsystems  typical  of  current  workstations. 

To  our  surprise,  we  found  that  heap  allocation  performed  well  on  some  memory  subsystems.  In 
particular,  on  an  actual  machine  (the  DE(  Station  5000/200).  the  memory  subsystem  performance 
of  heap  allocation  was  good.  However,  heap  allocation  performed  poorly  on  most  memory  subsys¬ 
tem  organizations.  The  memory  subsystem  property  crucial  for  achieving  good  performance  was 
the  ability  to  alloci  ;e  and  initialize  a  new  object  into  the  cache  without  a  penalty.  This  can  be 
achieved  by  having  subblock  placement  or  a  cache  large  enough  to  hold  the  allocation  area,  along 
with  fast  page-mode  writes  or  a  sufficiently  deep  write  buffer. 

We  found  for  caches  with  subblock  placement,  the  arithmetic  mean  of  the  data  cache  penalty 
was  under  9%  for  blK  or  larger  caches:  for  caches  without  subblock  placement,  the  mean  of  the 
data  cache  penalty  was  often  higher  than  50M.  We  also  found  that  a  cache  size  of'5l2K  allowed  the 
allocation  area  for  six  of  the  benchmark  programs  to  fit  in  the  cache,  which  substantially  improved 
the  performance  of  cache  organizations  without,  subblock  placement.. 

The  implications  of  these  results  are  clear.  First .  a  slack  is  not  needed  to  achieve  good  memory 
subsystem  performance.  Given  the  right  memory  subsystem,  heap  allocation  of  procedure  acti¬ 
vation  records  can  also  have  good  memory  subsystem  performance.  Heap  allocation  can  be  used 
without  a  performance  penalty  in  place  of  stack  allocation,  even  though  it  is  a  more  general  storage 
management  technique.  Second,  computer  architects  can  better  support  modern  languages  which 
make  heavy  use  of  dynamic  storage  allocation  on  machines  with  small  primary  caches  bv  using 
subblock  placement  with  a  subblock  size  of  1  word. 
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write  nn-.i1l(tt.n,i  suhblk.awn.  =2 
VVIHC  .llltK..Mlhhlk-.ISS«H.  =2 
writc-.illoe.no  suhhlk..iNvic=2 

leroy-nnps 
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u-k 


«:k 


256K 


M2K 


Spin  I  and  I)  Cache  sizes 

(a)  Block  size  =  Hi  bvtes 


3.5 
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wntc-nn-alliR:.no-subblk.assoc  =  l 

wnte-alloc.subblk.a.ssnc=l 

wnte-alloc.no-subblk.assoesl 

wnte-no-j||oc.ni»-\ubWk.avMic=2 

wnte-altnc.subblk.as\nc=2 

wriie  all»H..nn-\ubblk.assnc=2 

lemy-nnps 


XK 


'2K 


Split  I  and  D  Cache  sizes 

(l>)  Blork  size  =  ij  bytes 


Figure  10:  Knuth-Hmidix  summary 


Cyck*/U*ciul  uuiiuciion  CyilcvUselul  mstiuclinn 


ite-iKi-rfllitc.  im-suhblk.  issnc^l 


ic-alloc.  \uhblk.jxsoc=l 


-alloc.  m)  subblk.j\MK.=  I 


e  nn-.il!»>c.m*-stihblk.uNv*K.=2 


ic-.iil«K;.\ubhlk.a.w>t=2 


te-alloc.nn-\ubblk.usMK:=2 


lexgen-nups 


Spin  I  and  I)  Cache  mzcn 

(a)  [Mock  size  =  Hi  bytes 


wnie-nt>-al1oe.no-subblk.assoc=l 


ivnieall«K.subblk.assoc=  1 


wnie-alloc.nn-subblk.assoc=l 


wnie-nn-all»K:.no-subblk.asMx:=2 


wrne-alliK-.subblk.uNMK;=2 


write-alh*c.im-subblk.jsMK;=:2 


Spin  I  and  D  Cache  \izes 

(b)  Mock  size  =  IJ  bvtes 


Figure  11:  Lexgon  summary 


CyclesAJ&eful  instruction  Cycles/Usclul  instruction 


antc-no-alloc.  n*i-subbik.assoc=l 


ante  allot.  \ubblk.asstH.  =  l 


wrjtc-alliH;.  no-suhhlk.ass»tc=  I 


ante  no  a(loc-.nn-suhhlk.assoc=2 


kvntcalltH.'.suhblk.ass<»c:=2 


ante -alltK'.no-subblk. assoc =2 


Spin  I  and  D  Cache  sizes 

(a)  Block  size  =  16  bvles 


write-no-alloc.no-subblk.ax.soc=1 


wnte-.ill(K;.subblk.asstH;=  I 


wnte-ulloc.no-subblk.a.ssoc=l 


wnte-no-alloc.no-subblk.assoc=2 


wnte-alloc.subblk.a.ssoc=2 


write-alloc.no-subblk.asst>c=2 


Spill  I  and  D  Cache  sizes 

(b)  Block  size  =  1-  bytes 


Figure  12:  Life  summary 


Cycles/Usel'ul  instrucuon  Cyclcs/Uscful  instruction 
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®  wnte-no-alloc.no\ubblk.assoc=2 
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wnte-no-alloc. no-subblk.assoc=  I 
write-alloc.subblk.a.sMic=  I 
wnte-alloc.no-subblk.a.ssnt= I 
wnte-no-allnc.nn-%ubblk.asxoc=2 
wnte-allot.subblk.a.ssoc=2 
wme-aMot.no  suhblk.assot=2 
pta-nops 
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A2K  MK  I2HK 

Split  I  and  D  Cache  sizes 

lb)  Block  size  =  IJ  liyt.es 


Figure  13:  PIA  summary 
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•  vvnie  no-alloc.  no  subhlk.jssnc -  I 

vMiic-.itliK .  subblk. J\N«IC  =  l 

•  write-alloc.  noMjbblk.a>soc=  I 

•  write  no  alloc.no  %ubblk. aw tc=2 
writc-alloc.subblk.awic=2 

•  w r ite  alloc. no  subblk. assoc  =2 

•  simple- nops 
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i  a 


XK 


MK  !2XK  25Mc 

Spin  I  and  I)  Cache  m/cs 

(a)  Block  size  =  16  bytes 


s  i  :k 


2.5 


write-no-alloc.na-siibblk.assoc=  I 
write- jlloc.subblk.avNnc=  I 
wnte-alloc.mi-subblk.a.ssoc=  I 
wrne-no-aHoc.no-subblk.assoc=2 
wnie-allnc.subblk.assnc=2 
write  aMoc.no-subbtk.as.soc=2 
simple -nops 
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•  # 

MK  1 2NK 

Split  I  and  D  Cache  si/es 

(b)  Block  size  =  iJ  bytes 


Figure  I  t:  Simple  summary 
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CyclesAJsetul  instruction  Cycles/U#cful  instruction 


write-no-allnc.  no-Nubblk.a.ssoc=  1 


wrne-alloc.  subhlk.ass»H.  =  I 


Ante-.illttc.  no-\ubblk.asstic=l 


w  i  ite-no-ail«H;.nn-suhhlk.usM)c=2 


write  j|l<K.\\ubblk.av*oc=2 


vvrii^-alloe.no-\ubblk.asM»c=2 


Split  I  anJ  D  Cache  sizes 

(a)  Block  size  =  I (i  bvl.es 


wnte-no-alloc.no-sub-blk.a.ssoc=l 


wnte  dUoc.sub-blk.assnt=  I 


wnte-alloc.no-sub-blk.a.ssoc=l 


write-n»-alloc.no-sub-blk.a.sMic=2 


wnte-dlloc.siib-btk.usM'c=2 


wnte-alloc.no-sub-blk.ussoc=2 


Split  I  and  D  Cache  sizes 

(b)  Block  size  =  V>  bytes 


Figure  15:  YACC  summary 
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